As has been previously discussed, BHL has uncorrected text generated by Optical Character Recognition (OCR) software for each of its scanned volumes, and that uncorrected text has implications for data mining and accurate search. OCR results are notoriously poor because the technology hasn’t improved much since it was “solved” for forms processing in the mid-1980’s, which doesn’t really help BHL at all with our challenge of getting accurate text for our heterogeneous digital library spanning more than 500 years of printed publications.
Continue reading