Sample pages & books for OCR analysis

As has been previously discussed, BHL has uncorrected text generated by Optical Character Recognition (OCR) software for each of its scanned volumes, and that uncorrected text has implications for data mining and accurate search. OCR results are notoriously poor because the technology hasn’t improved much since it was “solved” for forms processing in the mid-1980’s, which doesn’t really help BHL at all with our challenge of getting accurate text for our heterogeneous digital library spanning more than 500 years of printed publications.

Luckily some new work is happening around OCR and text correction from various projects, including:

The BHL development team and other stakeholders recently highlighted “improvements to OCR text” as an emerging priority for BHL to pursue. Now that we have a sizable repository we’d like to turn development efforts towards enhancing our texts and making them as accurate as possible for our users and data consumers. A subset of BHL is required for various tests and experiments, and that list has begun forming at:
BHL Sample Texts for Retrieval and Evaluation

If you encounter other texts that pose a unique challenge, please indicate them either via BHL’s Feedback form or on a comment below.

Avatar for Chris Freeland
Written by

Chris Freeland served as the BHL Technical Director from 2006-2012. He is currently the Director of the Open Libraries program at Internet Archive. In this capacity he works with libraries & publishers to digitize their collections, working towards the Archive’s mission of providing “universal access to all knowledge.”