Unfortunately the camera battery ran out just before the last slide of this presentation!
Christoph Ringlstetter gave the context for his institution’s work on linguistic tools, looking at the specific challenges of historical material and composing a corpus of historical material.
Within IMPACT, CIS is helping to improve text recognition, through adapting OCR software to historical material with the provision of specialist dictionaries, and developing new types of user interface. There are examples of spelling variants and even unknown words in historical texts. This causes current OCR engines to break down, with problems caused by trying to make sense and introducing the wrong word, as well as producing gibberish.
A user can fail to retrieve results, even if the OCR is good, if the modern spelling input by the user is not mapped to historical variants. In order to deal with this, we need lexica, language models (contemporary corpora), statistical information and normalisation with a mapping between modern and historical spelling. These need to be based on a corpus, but there are not many keyed materials on the web. Many institutions are not willing to give their material away, and this means approaching institutions one by one to gather a corpus. Another approach would be to create a corpus through keying or correcting OCR. At the moment CIS has obtained a development corpus of material. Two variants of lexica have been created, a hypothetical lexicon, based on historical transformation patterns, and a witnessed Lexicon, collected manually.
The hypothetical approach requires no manual work, but does result in some mismatches. The witnessed lexicon is safer, but very expensive. In a test, use of these historical lexica significantly reduced the error rate in the OCR for 18th century and later, although for earlier material a different approach is needed, with a special lexicon.
Working with the BSB, a special lexicon is being developed for books on theology written in the 16th century. This first requires creating a corpus, as none was in existence. It is hoped to improve OCR with a special lexicon, and improve information retrieval with a normalisation lexicon.
Notes by Emma Huber, UKOLN.
0 Responses to “Session 6ii – Case Study – Targeted Language Resources for the Digitisation of Historical Collections”