As has been previously discussed, BHL has uncorrected text generated by Optical Character Recognition (OCR) software for each of its scanned volumes, and that uncorrected text has implications for data mining and accurate search. OCR results are notoriously poor because the technology hasn’t improved much since it was “solved” for forms processing in the mid-1980’s, which doesn’t really help BHL at all with our challenge of getting accurate text for our heterogeneous digital library spanning more than 500 years of printed publications.
Luckily some new work is happening around OCR and text correction from various projects, including:
- Rod Page’s examples with Wikisource and a Google-like text selector
- National Library of Australia’s Trove project (with their code)
- EU-funded Improving Access to Text (IMPACT) project
The BHL development team and other stakeholders recently highlighted “improvements to OCR text” as an emerging priority for BHL to pursue. Now that we have a sizable repository we’d like to turn development efforts towards enhancing our texts and making them as accurate as possible for our users and data consumers. A subset of BHL is required for various tests and experiments, and that list has begun forming at:
BHL Sample Texts for Retrieval and Evaluation
If you encounter other texts that pose a unique challenge, please indicate them either via BHL’s Feedback form or on a comment below.