Neil Fitzgerald of the British Library gave an overview of the digitisation process, starting off by looking at the key challenges, including deciding on technical standards for image capture, developing workflow tools to reduce the costs of pre- and post-processing, the challenges of operationalising project-driven digitisation, selecting material, creating metadata, and providing accurate OCR data.
The main focus of digitisation at the BL has previously been boutique digitisation of particular heritage items, often with private sponsorship. More recently, with Google’s entry into the market, things have changed. Competitors have come and gone, and national governments, as well as the EU, have taken the initiative. The equipment available has multiplied, allowing different approaches for different purposes. The next big thing will be the large-scale digitisation of historical material, such as manuscripts.
Google has industrialised the scanning process, with small adjustments throughout the workflow allowing increased efficiency and throughput. Further R&D is needed in this area, which is being undertaken by projects like IMPACT at the moment.
An example of a workflow allowing 100,000 images to be captured per day was shown, with quality assurance being done in a combination of automated and manual processes. Workflow steps have also been developed to feed back information about the collection to the library, such as OCR accuracy rates mapped to shelf mark, allowing future improved processes.
Metadata issues for digitisation include the lack of language information in traditional catalogue records – something which is of critical importance to OCR software, which uses dictionaries to improve accuracy.
Back in 2006, there were no market tools available for improving workflows. The BL developed a number of tools in house, such as filtering out items that couldn’t be scanned because of copyright. Today, there are two alternative solutions, the Arrow project, and a working beta from OCLC.
The deliverables of a digitisation project must be suitable for all future requirements. There is currently a demand for e-books and print-on-demand services, which will have an impact on the way texts are digitised in the future and the workflows required.
Services ensuring permanent access to collections include the HAITHI trust, PLANETS, the Digital Preservation Coalition and Life 3. These disparate services will have to come together to allow an efficient process.
Collaborative correction – there have been interesting developments by Australian Newspapers, which has been much more successful than anticipated, with top correctors doing more than 40 hours a week. There’s a real demand for collaborative correction out there, especially if you can target particular user groups, such as family history enthusiasts, to deal with specific issues. The code for the collaborative correction software is now available, and can be downloaded. There is competition among correctors, building a real community around the resource. The IMPACT project is looking at improving and expanding this functionality.
The future? The UK needs a more coordinated approach – see the Digital Britain report. A number of smaller countries have done this to promote their cultural heritage. In the UK there is a very complex structure, and the English language is so widespread it is hard to find focus. Hopefully this will be developed in the future. Europeana is available and stable, but needs more tools, functionality and promotion, so it’s still early days, but that too will be developed in the future.
Notes by Emma Huber, UKOLN.