Thesis
Jon Mills "Computer-assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes"
Abstract
This project sets out to discover and develop techniques for the lemmatisation
of a historical corpus of the Cornish language in order that a lemmatised
dictionary macrostructure can be generated from the corpus. The system
should be capable of uniquely identifying every lexical item that is attested in
the corpus. A survey of publish ed and unpublished Cornish dictionaries,
glossaries and lexicographical notes was carried out. A corpus was compiled
incorporating specially prepared new critical editions. An investigation int
the history of Cornish lemmatisation was undertaken. A system ic description
of Cornish inflection was written. Three methods of corpus lemmatisation
were trialed. Findings were as follows. Lexicographical history shapes current
Cornish lexicographical practice. Lexicon based tokenisation has advantages
over character based tokenisati . System networks provide the means to
generate base forms from attested word types. Grammatical difference is the
most reliable way of disambiguating homographs. A lemma that contains three
fields, the canonical form, the part -of-speec and a semantic field label,
provides of a unique code for every lexeme attested in the corpus. Programs
which involve human interaction during the lemmatisation process allow
bootstrapping of the lemmatisation database. Computerised morphological
processing may be used at least to par tially create the lemmatisation database.
Disambiguation of at least some of the most common homographs may be
automated by the use of computer programs.
Download the entire thesis in PDF format. (4172 KB)