Question: What can smaller institutions do to get OCR content online, once it has been created?
When putting in a bid for funding, delivery to end-users should be a key consideration, and the funding application should include money for putting the material online. Funding bodies are very open to this.
Wouldn’t it be in the JISC’s interest to create a simple out-of-the-box front end for OCRed text that could be used by smaller institutions?
Quite possibly! It is worth noting that once content is online, it is indexed by search engines, and a lot of users come to the text via Google, rather than the official interface. However, there may be a gap in the market for developing a front-end solution such as that proposed.
Question: Was re-digitising the microfilm of newspapers worthwhile? [directed at Aly Conteh of the British Library]
The British Library is now looking at digitising from the original, marking a shift in policy about the preferred medium for creating a surrogate. The questioner was interested in hearing about the BL’s experience of digitising from the original rather than microfilm. One of the case studies in the IMPACT project’s decision support tools will be dealing with the question of digitising from microfilm. Duplication of images when microfilm has been created for preservation purposes creates digitisation challenges.
Question: Does anyone have any experience of OCRing material in scripts other than roman?
A delegate had experience of OCRing Arabic with Sakhr software, which claims to have a 95% accuracy rate, although it seemed to be significantly less in practice. The library in Alexandria is advanced in digitising Arabic material.
Chinese, Japanese and Thai are included in ABBYY software, with Arabic following shortly. Chinese software is also available.
Question: The workshop did not discuss the indexing, processing and display of OCRed text once it has been created. Have issues cropped up in these areas?
The British Library doesn’t provide access to the text itself, as it works with a publisher. The BL did make a decision not to present the raw OCR to the user, as it didn’t help resource discovery, and research has shown that it is not accurate enough to be able to support other applications, such as the use of screen readers. Google do display the raw OCR, possibly to allow it to be indexed by other search engines, and because the quality of the OCR is better in books than newspapers.
Question: Can manuscripts be OCRed?
The BL’s work with manuscripts has mainly been in doing more image capture, rather than OCRing. Handwriting recognition is very difficult. It is possible if the handwriting is very even, but cursive handwriting is next to impossible at the moment. It may become possible in the future. The BL is looking into it, as well as providing other tools such as the possibility for users to annotate the image.
Collaborative correction is useful for this, and can help OCR engines to learn. The Australian Newspapers project doesn’t review the user input, instead relying on the community. It is possible to rank the reliability of correctors, by setting them to correct the same portion of text.
Question: How do you deal with malicious input in collaborative correction?
This isn’t difficult, as there would be a mismatch between the input and characters which were recognised with confidence by the OCR engine. However, the Australian Newspapers project has not encountered any malicious use. The previous version is always retained, allowing roll back if necessary.
Question: Does the British Library back up its data, or is it cheaper to rescan if necessary?
The BL replicates its content in a number of sites, including Boston Spa and the National Libraries of Scotland and Wales. It also has a dark archive.
Question: Is OCR created as a preservation object or a surrogate?
In the BL it is created as a preservation object using the METS/ALTO standard. METS is a structured container for metadata. ALTO is a standard for describing the OCRed text, along with coordinates and structural information. The resulting METS/ALTO file goes into the preservation store.
Question: Is it possible to correctly identify footnotes?
This is a matter of layout analysis. It would be possible to link superscripted text with footnotes if required.
Question: Do OCR engines have difficulties with a mixture of font families and sizes in one document?
This depends on whether each line is uniform. If not, then the OCR engine is likely to have difficulties!
Question: Can OCR engines cope with images which are largely pictorial, but which contain a small amount of text?
Yes, but there are various options. You can ignore any gibberish resulting from attempts to read the picture, you can manipulate the image to remove the picture information, or you can use tools provided by ABBYY to help. The best thing is to experiment with different methods and see what works best.
Notes by Emma Huber, UKOLN.