Panel Session

Question: What can smaller institutions do to get OCR content online, once it has been created?

When putting in a bid for funding, delivery to end-users should be a key consideration, and the funding application should include money for putting the material online. Funding bodies are very open to this.

Wouldn’t it be in the JISC’s interest to create a simple out-of-the-box front end for OCRed text that could be used by smaller institutions?

Quite possibly! It is worth noting that once content is online, it is indexed by search engines, and a lot of users come to the text via Google, rather than the official interface. However, there may be a gap in the market for developing a front-end solution such as that proposed.

Question: Was re-digitising the microfilm of newspapers worthwhile? [directed at Aly Conteh of the British Library]

The British Library is now looking at digitising from the original, marking a shift in policy about the preferred medium for creating a surrogate. The questioner was interested in hearing about the BL’s experience of digitising from the original rather than microfilm.  One of the case studies in the IMPACT project’s decision support tools will be dealing with the question of digitising from microfilm. Duplication of images when microfilm has been created for preservation purposes creates digitisation challenges.

Question: Does anyone have any experience of OCRing material in scripts other than roman?

A delegate had experience of OCRing Arabic with Sakhr software, which claims to have a 95% accuracy rate, although it seemed to be significantly less in practice. The library in Alexandria is advanced in digitising Arabic material.

Chinese, Japanese and Thai are included in ABBYY software, with Arabic following shortly. Chinese software is also available.

Question: The workshop did not discuss the indexing, processing and display of OCRed text once it has been created. Have issues cropped up in these areas?

The British Library doesn’t provide access to the text itself, as it works with a publisher. The BL did make a decision not to present the raw OCR to the user, as it didn’t help resource discovery, and research has shown that it is not accurate enough to be able to support other applications, such as the use of screen readers.  Google do display the raw OCR, possibly to allow it to be indexed by other search engines, and because the quality of the OCR is better in books than newspapers.

Question: Can manuscripts be OCRed?

The BL’s work with manuscripts has mainly been in doing more image capture, rather than OCRing. Handwriting recognition is very difficult. It is possible if the handwriting is very even, but cursive handwriting is next to impossible at the moment. It may become possible in the future. The BL is looking into it, as well as providing other tools such as the possibility for users to annotate the image.

Collaborative correction is useful for this, and can help OCR engines to learn. The Australian Newspapers project doesn’t review the user input, instead relying on the community. It is possible to rank the reliability of correctors, by setting them to correct the same portion of text.

Question: How do you deal with malicious input in collaborative correction?

This isn’t difficult, as there would be a mismatch between the input and characters which were recognised with confidence by the OCR engine. However, the Australian Newspapers project has not encountered any malicious use. The previous version is always retained, allowing roll back if necessary.

Question: Does the British Library back up its data, or is it cheaper to rescan if necessary?

The BL replicates its content in a number of sites, including Boston Spa and the National Libraries of Scotland and Wales.  It also has a dark archive.

Question: Is OCR created as a preservation object or a surrogate?

In the BL it is created as a preservation object using the METS/ALTO standard. METS is a structured container for metadata. ALTO is a standard for describing the OCRed text, along with coordinates and structural information. The resulting METS/ALTO file goes into the preservation store.

Question: Is it possible to correctly identify footnotes?

This is a matter of layout analysis. It would be possible to link superscripted text with footnotes if required.

Question: Do OCR engines have difficulties with a mixture of font families and sizes in one document?

This depends on whether each line is uniform. If not, then the OCR engine is likely to have difficulties!

Question: Can OCR engines cope with images which are largely pictorial, but which contain a small amount of text?

Yes, but there are various options. You can ignore any gibberish resulting from attempts to read the picture, you can manipulate the image to remove the picture information, or you can use tools provided by ABBYY to help. The best thing is to experiment with different methods and see what works best.

Notes by Emma Huber, UKOLN.

Session 6iii – Case Study – Using digitised text collections in research and learning

Online digitised text collections are normally presented in an interface with 2 or 3 levels, with different layers of complexity.

Little research has been done in this particular area. The talk was based on 15 semi-structered interviews with EEBO users. The main uses were:

  • Catalogue info for retrieval
  • discovering new material
  • analysing material

Users didn’t always fully understand the functions available. There were links between the IT skills of users and the way they chose to interact with the resource.

The impact on research included:

  • More texts can be accessed more quickly – which could mean that obscure texts take on too much importance.
  • Different texts are consulted – Mss texts could be being ignored, although some users reported seeking out hard-to-find texts in order to gain brownie points with examiners. Keyword searching was sometimes used to filter boring reading!
  • Different methods used – some types of research are possible for the first time, wider possiblities for investigation due to ablity to investigate corpus. This could be linked to a trend for interdiscplinary research.

Ambivalent attitude to use of electronic resources – I don’t cite EEBO – if a text is electronic it doesn’t seem to be equivalent to original! Is it better to look at the orignal?

Resources are having an impact on research, but there may be pitfalls:

  • Risks of keyword searching – it has been suggested that it could impede human learning…
  • Lack of understanding on how text was created – may impact on method design – more knowledge required.

Greater transparency required by resource creators and help should be offered to end users. Training should have more of a research slant to benefit use of the resource.

Feedback from room – users don’t read supplied information! Recommend proper site testing!

Notes by Neil Fitzgerald, British Library.

Session 6ii – Case Study – Targeted Language Resources for the Digitisation of Historical Collections

Unfortunately the camera battery ran out just before the last slide of this presentation!

Christoph Ringlstetter gave the context for his institution’s work on linguistic tools, looking at the specific challenges of historical material and composing a corpus of historical material.

Within IMPACT, CIS is helping to improve text recognition, through adapting OCR software to historical material with the provision of specialist dictionaries, and developing new types of user interface.  There are examples of spelling variants and even unknown words in historical texts. This causes current OCR engines to break down, with problems caused by trying to make sense and introducing the wrong word, as well as producing gibberish.

A user can fail to retrieve results, even if the OCR is good, if the modern spelling input by the user is not mapped to historical variants. In order to deal with this, we need lexica, language models (contemporary corpora), statistical information and normalisation with a mapping between modern and historical spelling.  These need to be based on a corpus, but there are not many keyed materials on the web.  Many institutions are not willing to give their material away, and this means approaching institutions one by one to gather a corpus. Another approach would be to create a corpus through keying or correcting OCR.  At the moment CIS has obtained a development corpus of material. Two variants of lexica have been created, a hypothetical lexicon, based on historical transformation patterns, and a witnessed Lexicon, collected manually.

The hypothetical approach requires no manual work, but does result in some mismatches. The witnessed lexicon is safer, but very expensive.  In a test, use of these historical lexica significantly reduced the error rate in the OCR for 18th century and later, although for earlier material a different approach is needed, with a special lexicon.

Working with the BSB, a special lexicon is being developed for books on theology written in the 16th century. This first requires creating a corpus, as none was in existence. It is hoped to improve OCR with a special lexicon, and improve information retrieval with a normalisation lexicon.

Notes by Emma Huber, UKOLN.

Session 6i – Case Study – British Library/JISC Newspaper project

Aly Conteh started off by referring back to the question posed before the break – he felt that suitabilty for digitisation should only be one factor in the selection of material to be digitised. For newspapers, there is a historical period when the OCR quality is not good, but the desire to digitise complete runs overtook this.   OCR is worthwhile, and a lot of work is being done to continually improve OCR results and overcome the “issues”.

There are 825 million pages to digitise in the British Library!

The newspaper project was funded by the JISC back in 2003. From the beginning, the BL wanted to do article zoning and OCR. The aim was to provide free access to the academic community, and to do out-of-copyright material, with UK-wide coverage. 160 million pages were eligible, so the BL worked with the community to decide which 4 million pages to digitise. It worked out at 48 titles, with great variety in size and layout.

Apostolos had gone through many of the problems with the source material earlier in the day – an additional one was animals, with a cat’s footprints shown across one newspaper page. In 2003, when the project started, it was not possible to come up with a figure for acceptable OCR accuracy levels – it would have been no more than a gut feeling. Microfilm was created initially, and digital images were created from that. Metadata, XML encoding and images then came together to create the resource.

The question of bitonal or greyscale was really decided by the content, which included many illustrations, making greyscale preferable.  The OCR results ranged from exceptionally good to worthless.  The BL really wanted to know how to measure how successful the OCR process was, bearing in mind it was an automated process, where images could not be tweaked individually. What do you measure? Character accuracy? Word accuracy? Accuracy of significant words? In fact, there is a very strong correlation between all of these factors, so it may not matter which measure you choose.

OCR can come up with some unusual results, such as “internet” and “dvd” in 19th century newspapers – highlighting the need for historical dictionaries.

Why does good quality OCR matter? 70%-80% accuracy is ok, particularly with fuzzy search. Why try to improve it? One reason is that you need good quality text to be able to use text mining tools effectively. Progress with OCR is necessary to make this type of activity possible.

Storage has been an issue, and JPEG 2000 has been used to reduce the storage requirements.  Out of the total cost of the project, OCR took up about 5%. Effort was made to improve the quality of the article headings, and if the costs of this are added, the proportion spent on OCR goes up.

Notes by Emma Huber, UKOLN.

Session 5 – Improving and adding value to OCR results – the IMPACT project

As it was just 5 minutes before lunch, this session turned into a very brief overview of the IMPACT project.  Attention was drawn to the IMPACT project flyer, included in the delegate pack.

The EC has been supporting the creation of content for some time, e.g with the i2010 Digital Library initiative. It is trying to bring together European cultural heritage, for example through the Europeana portal. It is a challenge to digitise all the material available, with only about 1% having been digitised so far.

OCR allows users to get to the texts, and supports the use of digitised images, but current OCR tools are not always satisfactory for historical documents. Collaborative correction offers a way forward here.

The IMPACT project is a large EU funded project, led by the National Library of the Netherlands, and many of the partners are national or large libraries who have been involved in digitisation already. The aim of the project is to improve the means of digitisating historical material.  There are other aims of sharing expertise and building capacity across Europe. These are challenging objectives, and cover the whole the digitisation workflow. The work of the technical developers of OCR tools must be grounded in the real needs of libraries, and it is sometimes challenging ensuring that this focus is maintained.

IMPACT is a large project with 22 work packages split into 4 sub projects. Some of things the project is engaged in include: improving the information that can be extracted from images; adaptive OCR and experimental tools; enhancing and enriching the OCR text, through collaborative correction, linguistic tools, and automatic creation of structural metadata; strategic tools and services, such as decision support tools, learning resources and a Helpdesk.  These will all become part of a developing Centre of Competence. Please look at the project website for further information!

Question: Is the project developing its own OCR engine?

It is working with ABBYY and IBM to develop their software, but there will also be some additional elements.

Question: The issues of OCR have an impact on the selection of material for digitisation. Do institutions currently only select items suitable for digitisation with current technology?

It is true that collections may be chosen if they are felt to have fewer problems, but work is ongoing to ensure more problematic items can be included.

The Natural History Museum has a pragmatic approach. The focus is on scanning materials at the best possible quality, focussing on capturing particular keywords within the text for the museum’s current needs, but the aim is to OCR everything at a later date, when technology has improved sufficiently.

Notes by Emma Huber, UKOLN.

Session 4 – Document Image Analysis for Text Recognition

Apostolos Antonacopoulos spoke about the processes that are undertaken before OCR is done, looking at documents themselves, issues (problems… opportunities??) and workflow.

Digitised documents range from manuscripts to newspapers, from old, high-quality texts, to mass produced low-quality material. There are many different languages, often a combination within one text. There are different typefaces, again with variation within a single text, and layout varies greatly, with multi-columns, decorative borders, etc.

Some issues are inherent in the document itself: bleed-through, smear-over from opposite page, typesetting peculiarites, paper texture etc. Some issues arise from use: folds, tears, annotations, stains, repairs, holes etc. Yet more issues arise from storage conditions: warping, discolouration, mould, shrinkage, fading etc. Some issues also arise during scanning, due to uneven illumination, for example. A number of examples of these phenomena were shown.

In a digitisation workflow, libraries (or other similar institutions) prepare documents, scan then, examine the original documents and the quality of the scans, and sometimes perform validation and correction of OCR.  The results are then hosted.  The costs of OCR can work out about the same as everything else put together – it is not a cheap extra!

The mains steps for doing OCR are: scanning, image enhancement, layout analysis, OCR and post-processing.

Scanning quality varies – there are issues of resolution, colour depth and compression. Some institutions choose bitonal over compressed greyscale, even though compressed greyscale would contain more information about the original. Overhead scanners capture page curl, while book scanners such as Treventus largely eliminate this.

Image enhancement makes it easier for automated processes and makes the text more readable for humans. Processes typically include page splitting (if a double-page spread), border removal, dewarping, deskewing, geometrical correction, removal of “noise” (stains etc.) and binarisation (necessary for Document Image Analysis methods).

Layout analysis includes the segmentation and classification of blocks, text lines, words and characters. It tends to work by analysing the spacing, but spaces between blocks are not always greater than spaces between words, so it is extremely problematic.

Some issues still need further work. Correct binarisation is still an issue, as are dewarping of difficult cases and correction of folds, local warping etc. Further research is being done on colour analysis to correct stains etc., dealing with noise, wide/narrow spacing, identifying advertisements when they occur next to the main text and article tracking in newspapers.

Notes by Emma Huber, UKOLN.

Session 3 – Introduction to OCR

Guenter Muehlberger gave a summary of the experiences of his institution in using OCR, and went on to present a few use cases.  The first was the Digitised Card Catalogue, made in 2004, which allows two types of search, of entry points, and of full text.  There was some debate about whether to provide a full text search, since the text was incomplete.  A solution was to display the OCR output alongside the image, which allows users to see where errors in the OCR have occurred. Users were also given the opportunity to correct the OCR, in a simple structured form. 20-30 cards are corrected in this way every day. There is a control mechanism to check for misuse or mistakes. It has been running for five years with good results.

Problems with OCR arise with poor image quality, handwritten notes, ink stamps, special characters etc. – 85% of words were recognised correctly overall, which means that about 70% of cards can be retrieved through full text searching. The OCR costs were relatively low, and in the future it mat be possible to automatically match records with existing WorldCat records, meaning that there have been no regrets about the decision to OCR.

One company dominates the OCR field – ABBYY. Free software is available – Ocropus, which can sometimes produce good results, but is a long way off being able to be used productively – and definitely not being used by Google!

What do we expect from OCR accuracy rates? Researchers don’t want any errors at all in a critical edition. Publishers accept an error rate of maybe 1 error in 200,000 characters. A recent purchase of a commercial ebook had 2 errors per page – but this was acceptable for the material. It depends on the purpose!

If time had allowed Guenter would have shown the importance of choosing the correct language when using an OCR output, but went straight on to how to store the OCR output, which can be PDF, usually with the text behind the image, or XML, which can hold rich information about the object.

There are difficulties in measuring OCR accuracy (characters, words, blocks of text, layout etc.). Guenter recommends measuring by word, as this is easy to measure and understand. As a rule of thumb, good results from books can be achieved from 1800, newspapers, and electric typewriters. Mechanical typewriters have inconsistencies which cause problems.

Top tips:

Many people do not analyse what they want to digitise: which are the important pieces, are there sections which are repeated, is there metadata available, are dictionaries available? The more that is known, the more precisely the OCR engine can be configured.

Select test material randomly! Do not hide difficult material in order to exaggerate the potential of the resource, nor select it in order to prove OCR will never work anyway!

Use dictionaries, train the engine to recognise particular characters – but only for frequently occurring special cases, and look into buying characters. Follow the developments from the IMPACT project, which will be improving FineReader, developing adaptive OCR, and producing language tools.

Question – will the technology be relevant in the future? There will always be a danger that texts will have to be re-OCRed as technology develops, but retaining high-quality images should make this possible.  Rescanning images is also a last resort.

Notes by Emma Huber, UKOLN.



Follow

Get every new post delivered to your Inbox.