As it was just 5 minutes before lunch, this session turned into a very brief overview of the IMPACT project. Attention was drawn to the IMPACT project flyer, included in the delegate pack.
The EC has been supporting the creation of content for some time, e.g with the i2010 Digital Library initiative. It is trying to bring together European cultural heritage, for example through the Europeana portal. It is a challenge to digitise all the material available, with only about 1% having been digitised so far.
OCR allows users to get to the texts, and supports the use of digitised images, but current OCR tools are not always satisfactory for historical documents. Collaborative correction offers a way forward here.
The IMPACT project is a large EU funded project, led by the National Library of the Netherlands, and many of the partners are national or large libraries who have been involved in digitisation already. The aim of the project is to improve the means of digitisating historical material. There are other aims of sharing expertise and building capacity across Europe. These are challenging objectives, and cover the whole the digitisation workflow. The work of the technical developers of OCR tools must be grounded in the real needs of libraries, and it is sometimes challenging ensuring that this focus is maintained.
IMPACT is a large project with 22 work packages split into 4 sub projects. Some of things the project is engaged in include: improving the information that can be extracted from images; adaptive OCR and experimental tools; enhancing and enriching the OCR text, through collaborative correction, linguistic tools, and automatic creation of structural metadata; strategic tools and services, such as decision support tools, learning resources and a Helpdesk. These will all become part of a developing Centre of Competence. Please look at the project website for further information!
Question: Is the project developing its own OCR engine?
It is working with ABBYY and IBM to develop their software, but there will also be some additional elements.
Question: The issues of OCR have an impact on the selection of material for digitisation. Do institutions currently only select items suitable for digitisation with current technology?
It is true that collections may be chosen if they are felt to have fewer problems, but work is ongoing to ensure more problematic items can be included.
The Natural History Museum has a pragmatic approach. The focus is on scanning materials at the best possible quality, focussing on capturing particular keywords within the text for the museum’s current needs, but the aim is to OCR everything at a later date, when technology has improved sufficiently.
Notes by Emma Huber, UKOLN.