Sample pages & books for OCR analysis
As has been previously discussed, BHL has uncorrected text generated by Optical Character Recognition (OCR) software for each of its scanned volumes, and that uncorrected text has implications for data mining and accurate search. OCR results are notoriously poor because the technology hasn’t improved much since it was “solved” for forms processing in the mid-1980’s, which doesn’t really help BHL at all with our challenge of getting accurate text for our heterogeneous digital library spanning more than 500 years of printed publications.
Luckily some new work is happening around OCR and text correction from various projects, including:
- Rod Page’s examples with Wikisource and a Google-like text selector
- National Library of Australia’s Trove project (with their code)
- EU-funded Improving Access to Text (IMPACT) project
The BHL development team and other stakeholders recently highlighted “improvements to OCR text” as an emerging priority for BHL to pursue. Now that we have a sizable repository we’d like to turn development efforts towards enhancing our texts and making them as accurate as possible for our users and data consumers. A subset of BHL is required for various tests and experiments, and that list has begun forming at:
BHL Sample Texts for Retrieval and Evaluation
If you encounter other texts that pose a unique challenge, please indicate them either via BHL’s Feedback form or on a comment below.
Paul, any updates on the "Online Volunteers Portal" work?
Are you referring to OCR technology in general or the OCR engine in djvu? I don't have any experience with djvu (yet) but I've run tests on the online images in BHL using another OCR application and achieved greater accuracy with default settings. If it's possible to construct a djvu file from images and xml it should theoretically be possible to separate the OCR process from the output file creation and use the most appropriate application for each.
Hi Chris,
Last week Ely Wallis and I held discussions around an Atlas of Living Australia project to further develop the Trove OCR correction application and integrate it with BHL Australia. I am leading the project, called "Online Volunteers Portal", it is scheduled for 2011.