Aly Conteh of the British Library opened the day, explaining the context of the workshop.
Speaking of his own experience of OCR, before joining the British Library, OCR was just something that was done, and was thought to do a pretty good job. But when digitising historical texts, you start running into problems – the question then is, what to do to improve the quality of the text, and this is what today is about – looking at the technology that is being developed to improve OCR.
An important aspect is the digitisation workflow:
- Selection of material
- Capture of images
- Process – this is where OCR and other enhancement processes come in
- Access – typically through the web, with full text searching
- Preservation – particularly for cultural heritage institutions
This workshop is focusing on the process stage, which includes image enhancement, binarisation and segmentation, applying the OCR, which usually runs the text through an internal dictionary to improve accuracy, and providing metadata to match the OCR to the image from which it is taken.
This is a lot to discuss in just one day! Delegates were asked to think of questions throughout the day to ask at the panel session at the end.
Notes by Emma Huber, UKOLN.