Aly Conteh started off by referring back to the question posed before the break – he felt that suitabilty for digitisation should only be one factor in the selection of material to be digitised. For newspapers, there is a historical period when the OCR quality is not good, but the desire to digitise complete runs overtook this. OCR is worthwhile, and a lot of work is being done to continually improve OCR results and overcome the “issues”.
There are 825 million pages to digitise in the British Library!
The newspaper project was funded by the JISC back in 2003. From the beginning, the BL wanted to do article zoning and OCR. The aim was to provide free access to the academic community, and to do out-of-copyright material, with UK-wide coverage. 160 million pages were eligible, so the BL worked with the community to decide which 4 million pages to digitise. It worked out at 48 titles, with great variety in size and layout.
Apostolos had gone through many of the problems with the source material earlier in the day – an additional one was animals, with a cat’s footprints shown across one newspaper page. In 2003, when the project started, it was not possible to come up with a figure for acceptable OCR accuracy levels – it would have been no more than a gut feeling. Microfilm was created initially, and digital images were created from that. Metadata, XML encoding and images then came together to create the resource.
The question of bitonal or greyscale was really decided by the content, which included many illustrations, making greyscale preferable. The OCR results ranged from exceptionally good to worthless. The BL really wanted to know how to measure how successful the OCR process was, bearing in mind it was an automated process, where images could not be tweaked individually. What do you measure? Character accuracy? Word accuracy? Accuracy of significant words? In fact, there is a very strong correlation between all of these factors, so it may not matter which measure you choose.
OCR can come up with some unusual results, such as “internet” and “dvd” in 19th century newspapers – highlighting the need for historical dictionaries.
Why does good quality OCR matter? 70%-80% accuracy is ok, particularly with fuzzy search. Why try to improve it? One reason is that you need good quality text to be able to use text mining tools effectively. Progress with OCR is necessary to make this type of activity possible.
Storage has been an issue, and JPEG 2000 has been used to reduce the storage requirements. Out of the total cost of the project, OCR took up about 5%. Effort was made to improve the quality of the article headings, and if the costs of this are added, the proportion spent on OCR goes up.
Notes by Emma Huber, UKOLN.