Guenter Muehlberger gave a summary of the experiences of his institution in using OCR, and went on to present a few use cases. The first was the Digitised Card Catalogue, made in 2004, which allows two types of search, of entry points, and of full text. There was some debate about whether to provide a full text search, since the text was incomplete. A solution was to display the OCR output alongside the image, which allows users to see where errors in the OCR have occurred. Users were also given the opportunity to correct the OCR, in a simple structured form. 20-30 cards are corrected in this way every day. There is a control mechanism to check for misuse or mistakes. It has been running for five years with good results.
Problems with OCR arise with poor image quality, handwritten notes, ink stamps, special characters etc. – 85% of words were recognised correctly overall, which means that about 70% of cards can be retrieved through full text searching. The OCR costs were relatively low, and in the future it mat be possible to automatically match records with existing WorldCat records, meaning that there have been no regrets about the decision to OCR.
One company dominates the OCR field – ABBYY. Free software is available – Ocropus, which can sometimes produce good results, but is a long way off being able to be used productively – and definitely not being used by Google!
What do we expect from OCR accuracy rates? Researchers don’t want any errors at all in a critical edition. Publishers accept an error rate of maybe 1 error in 200,000 characters. A recent purchase of a commercial ebook had 2 errors per page – but this was acceptable for the material. It depends on the purpose!
If time had allowed Guenter would have shown the importance of choosing the correct language when using an OCR output, but went straight on to how to store the OCR output, which can be PDF, usually with the text behind the image, or XML, which can hold rich information about the object.
There are difficulties in measuring OCR accuracy (characters, words, blocks of text, layout etc.). Guenter recommends measuring by word, as this is easy to measure and understand. As a rule of thumb, good results from books can be achieved from 1800, newspapers, and electric typewriters. Mechanical typewriters have inconsistencies which cause problems.
Many people do not analyse what they want to digitise: which are the important pieces, are there sections which are repeated, is there metadata available, are dictionaries available? The more that is known, the more precisely the OCR engine can be configured.
Select test material randomly! Do not hide difficult material in order to exaggerate the potential of the resource, nor select it in order to prove OCR will never work anyway!
Use dictionaries, train the engine to recognise particular characters – but only for frequently occurring special cases, and look into buying characters. Follow the developments from the IMPACT project, which will be improving FineReader, developing adaptive OCR, and producing language tools.
Question – will the technology be relevant in the future? There will always be a danger that texts will have to be re-OCRed as technology develops, but retaining high-quality images should make this possible. Rescanning images is also a last resort.
Notes by Emma Huber, UKOLN.