SAMUELS Project Portal

SAMUELS Project Logo

The SAMUELS project (Semantic Annotation and Mark-Up for Enhancing Lexical Searches) developed a new semantic annotation tool, the Historical Thesaurus Semantic Tagger (HTST) and produced semantically-annotated versions of two major text corpora, the Semantic Hansard Corpus and the Semantic EEBO (Early English Books Online) corpus. By annotating large textual datasets such as linguistic corpora with semantic tags, powerful new ways of exploring their data are made available. Users can search Semantic Hansard and Semantic EEBO not only for a word but for a concept, and can explore the ways in which these concepts relate to one another rapidly and accurately, a process which can be slow and painstaking using previously available resources. Additionally, semantic annotation allows users to search for a desired meaning of a word with multiple senses (such as bank which may mean ‘river bank’, ‘financial institution’, or ‘piggy bank’, amongst other things), without having to laboriously eliminate irrelevant hits from their results.

The Historical Thesaurus Semantic Tagger integrates elements of the gold-standard USAS and CLAWS taggers and expands their capabilities in two main ways; it utilises an extensive fine-grained set of meaning classifications in its tagging pipeline, and it can be used on historical forms of the language as well as on present day English. These advancements are made possible through the use of data from the Historical Thesaurus of English, the only thesaurus thus far created with full coverage of a language in its modern and historical forms. The Historical Thesaurus also provides a link to the Oxford English Dictionary, whose enormous and complex database of words' variant spellings are integrated into a tagger here for the first time.

The research team included experts in natural language processing at Lancaster University’s University Centre for Computer Corpus Research on Language (UCREL), including the developers of the original UCREL Semantic Analysis System (USAS) semantic tagger and the creator of the Variant Detector (VARD) system for normalising word spelling in historical text. Semanticists and corpus linguists at the University of Glasgow ran the project, provided knowledge of meaning relationships, and worked to tailor a version of the Historical Thesaurus hierarchy to the tagger’s needs. Colleagues at the University of Huddersfield and University of Central Lancashire tested the utility of the tagger’s output on pilot projects, both of which have led to further research and funding.

The SAMUELS project was funded by the Arts and Humanities Research Council in conjunction with the Economic and Social Research Council (grant reference AH/L010062/1) from January 2014 to April 2015.

The SAMUELS consortium consisted of the University of Glasgow (lead institution), Lancaster University, the University of Huddersfield, the University of Central Lancashire, the University of Strathclyde, and Oxford University Press. Our international partners were Brigham Young University (Utah), Åbo Akademi University (Finland), and the University of Oulu (Finland).

 

The SAMUELS project was funded by the AHRC with the ESRC

Logo of the Economic and Social Research Council

Semantically-Annotated Corpora

The two corpora which were annotated using the HTST are available through the website english-corpora.org, created and maintained by corpus specialist Professor Mark Davies. Semantic Hansard contains approximately 1.6 billion words, consisting of a record of spoken contributions in the Houses of Commons and Lords in the UK parliament between 1803 and 2005. Semantic EEBO contains 755 million words and represents a selection of material printed in English or in English-speaking countries between roughly 1470 and 1700. Together, these two represent some of the largest and most complex corpora of historical English currently available, which made them ideal for exploration through semantic annotation.

The english-corpora.org interface allows researchers to search the corpora either by word or by semantic category. A user may, for example, wish to search for the word happiness or for all words tagged with the semantic label ‘AU11: Happiness’. The results of such searches are statistics for frequency of occurrence and concordance lines showing the context in which the search items appear. Users can also search for collocation information both by word and by semantic category. ‘Collocates’ are words which regularly appear in the vicinity of a search word or, where the search term is a semantic category, appear in the vicinity of one or more words tagged as belonging to that category. Collocation information is important in the study of relationships between concepts, as well as the way in which speakers understand distinctions in the meaning of related words.

Reliable semantic annotation of large text corpora opens up new possibilities for researching large-scale patterns in the relationships between ideas, as well as the existence of repeated word or semantic field pairings which act to distinguish components of meaning. Word relationships are increasingly used in teaching computers to understand ‘meaning’ in language, and it may be the case that the concept relationships encoded in a thesaurus hierarchy are crucial to taking the step from recognising relationships between words (as strings of characters) to recognising the relationships between ideas.

Project Methodology

The project teams at Lancaster and Glasgow worked together to create a semantic annotation tool which draws on the data and structure of the Historical Thesaurus of English, build on the earlier USAS tagger. The USAS software had to be adapted to utilise the branching tree-like hierarchy of the Historical Thesaurus, calculating a ‘distance’ measure for possible word meanings which helped the system to select the most likely meaning of a word in context. The system also had to filter possible word meanings using date information contained within the Historical Thesaurus data, creating the first semantic annotation system to distinguish between word senses based on the date of the text to be annotated. Details on the technical development of the tagger can be found in the journal of Computer Speech and Language 46 (2017: 113-135).

A further requirement of the analysis of historical text was the incorporation of spelling normalisation in the tagging process. Texts from the Early Modern period and earlier may exhibit multiple spellings for many words prior to the establishment of widespread standardised spelling in English. Such spelling complexity was smoothed out by integrating within the HTST workflow the VARD 2 (VARiant Detector) system, also developed at Lancaster University's UCREL, and by utilising the Oxford English Dictionary’s variant spelling database in corpus linguistics research for the first time.

In order to improve the user experience, the Glasgow team created a new version of the Historical Thesaurus hierarchy which could be overlaid on the full hierarchical structure. This streamlined ‘thematic category set’ collapses the most fine-grained meaning categories without removing the words contained within them, allowing users to find important concepts more quickly than would be possible browsing the complete listing of over 200,000 categories. The resulting thematic categories also proved useful for the project as a level at which tagging results can be reliably aggregated. In addition, new codes were created for grammatical items not normally found in a thesaurus (such as the and or).

As the project proceeded, sub-projects at partner institutions worked with the data. Research associates at Glasgow, Huddersfield, and the University of Central Lancashire (UCLAN) manually annotated a selection of texts with semantic tags, creating test data against which the tagger’s accuracy could be evaluated. From the point at which a prototype tagger existed, the sub-projects began conducting research on tagged samples of the Hansard and EEBO corpora. The main sub-projects reported on progress at the end of project meeting, since which work led by Professor Dawn Archer (MMU, formerly UCLAN) has been published and a follow-on project exploring the semantically annotated Hansard data has been developed by the Huddersfield team, led by Professor Lesley Jeffries.

Why Use the Historical Thesaurus of English for Semantic Annotation?

There are two main benefits which accrue from using the Historical Thesaurus of English as a source of data in semantic annotation. Firstly, it has an unrivalled classification of the senses of each word in the language and, secondly, it includes words from the entire history of the language.

 

Sense Disambiguation

The most significant issue in dealing with large textual datasets is that our primary methodology for searching them, then aggregating and analysing the results, relies not on concepts or meanings but rather on word forms. These forms – effectively strings of letters, potentially including punctuation and spaces – are imperfect and evasive proxies for the meanings to which they refer; 60% of word forms in English refer to more than one meaning, and some word forms refer to close to two hundred meanings. The word spring, for example, has 150 possible meanings. The "noise" which appears when searching using word forms grows with the size of the texts being searched – a traditional search for ‘spring’ would return every use of the word to denote a season of the year, as well as every use for a coiled piece of metal, and (potentially) uses for a type of salmon and a name for egg-yolk, amongst others.

In big data contexts, this problem confounds research, with analyses becoming entirely intractable and requiring impossible amounts of manual intervention. Semantic taggers approach this problem by automatically labelling every word in an input text with the correct meaning of the word in context. The highly successful UCREL Semantic Analysis System (USAS), used as a foundation on which to create the HTST, was already capable of tagging texts with semantic labels based on a modern thesaurus containing over 45,000 words and almost 19,000 multi-word expressions within 232 meaning categories. The Historical Thesaurus of English dataset is considerably larger at almost 800,000 entries within 235,000 meaning categories, and the use of such a vast thesaurus provides the ability to make much finer-grained distinctions in word meanings.

As well as allowing searching for a specific meaning of a polysemous word form, semantic annotation makes possible concept-based rather than word-based searching of texts, using a semantic category as a search term rather than a word form. The HTST therefore permits researchers to perform comprehensive searches for concepts (such as power, morality, disease, faith, emotions, war, food, or the bodily senses) rather than expend effort on building lists of words to search for in corpora, and it facilitates new techniques of exploring, searching for, and investigating large-scale phenomena in big humanities datasets. Examples of this type of work can be found in the end-of-project meeting and publications available on the Project Outputs page.

 

Historical Lexis

Previous English semantic annotation software has been based on thesauri of the present-day language, which are excellent for text produced within roughly the last century. However, there is an extensive and ever-growing volume of historical text data being digitised and made accessible by archives and libraries around the world, much of which cannot be adequately annotated by a semantic tagger using only present-day data. Problems arise partly because words in older texts may have dropped out of use and so not be recorded in present thesauri, meaning these words go unrecognised by the tagger. Alternatively, the meanings of words can shift over time so that, for example, a present-day thesaurus would not record that the word sailor could refer to a sailing ship in the 18th Century; words of this kind would be incorrectly labelled by a tagger which only recognises current word meanings.

Faced with these limitations, much historical text data would either be excluded from any analysis performed using semantic annotation, or the results would have to be accepted as error-ridden. This is a far from ideal situation, as historical writing is important not only to academic researchers in linguistics and history, but also to authors, journalists, family historians and a host of other users. As the amount of digitised historical text grows, so too do the needs of these users for more effective ways of finding the information they desire. Semantic taggers, therefore, need to be able to handle past forms of the language if they are to address the complexities of historical texts and archive documents.

The Historical Thesaurus of English is an ideal source of historical semantic data. It is based on historical dictionaries, primarily the Oxford English Dictionary, and therefore includes words from the entire history of the English language (although Old English presents a unique set of challenges and was not included in the development of the HTST). The Historical Thesaurus therefore contains the most complete listing of historical English words as well as the most comprehensive division of those words into senses in any thesaurus presently available for any language. The inclusion of dates with word meanings feeds into sense disambiguation processes, allowing the tagger to include or exclude meanings of polysemous words which were not active at the time an input text was written.