Apostolos Antonacopoulos spoke about the processes that are undertaken before OCR is done, looking at documents themselves, issues (problems… opportunities??) and workflow.
Digitised documents range from manuscripts to newspapers, from old, high-quality texts, to mass produced low-quality material. There are many different languages, often a combination within one text. There are different typefaces, again with variation within a single text, and layout varies greatly, with multi-columns, decorative borders, etc.
Some issues are inherent in the document itself: bleed-through, smear-over from opposite page, typesetting peculiarites, paper texture etc. Some issues arise from use: folds, tears, annotations, stains, repairs, holes etc. Yet more issues arise from storage conditions: warping, discolouration, mould, shrinkage, fading etc. Some issues also arise during scanning, due to uneven illumination, for example. A number of examples of these phenomena were shown.
In a digitisation workflow, libraries (or other similar institutions) prepare documents, scan then, examine the original documents and the quality of the scans, and sometimes perform validation and correction of OCR. The results are then hosted. The costs of OCR can work out about the same as everything else put together – it is not a cheap extra!
The mains steps for doing OCR are: scanning, image enhancement, layout analysis, OCR and post-processing.
Scanning quality varies – there are issues of resolution, colour depth and compression. Some institutions choose bitonal over compressed greyscale, even though compressed greyscale would contain more information about the original. Overhead scanners capture page curl, while book scanners such as Treventus largely eliminate this.
Image enhancement makes it easier for automated processes and makes the text more readable for humans. Processes typically include page splitting (if a double-page spread), border removal, dewarping, deskewing, geometrical correction, removal of “noise” (stains etc.) and binarisation (necessary for Document Image Analysis methods).
Layout analysis includes the segmentation and classification of blocks, text lines, words and characters. It tends to work by analysing the spacing, but spaces between blocks are not always greater than spaces between words, so it is extremely problematic.
Some issues still need further work. Correct binarisation is still an issue, as are dewarping of difficult cases and correction of folds, local warping etc. Further research is being done on colour analysis to correct stains etc., dealing with noise, wide/narrow spacing, identifying advertisements when they occur next to the main text and article tracking in newspapers.
Notes by Emma Huber, UKOLN.