On Name Finding in the BHL

An important feature of the Biodiversity Heritage Library that sets it apart from other mass digitization projects is our incorporation of algorithms and services to mine taxonomically-relevant data from of the 2.9 million (as of the date of this posting) pages digitized through our partnership with the Internet Archive. These services, including TaxonFinder, developed by partners at uBio.org, allow BHL to identify words in digitized literature that match the characteristics of latin-based scientific names, then verify accuracy of the word or words being a scientific name by comparing them to NameBank, uBio.org’s repository of more than 10.7 million recorded scientific names and their variants. The resulting index of names found throughout these historic texts is an incredibly valuable dataset, whose richness and use has just begun development.

The massive index and interfaces to it are new (from development to production within 8 weeks), so the BHL Development Team has been gathering feedback from users, evaluating usage statistics, and working with both librarians and scientists to determine what is working with the interface and what needs refinement. The following issues have been identified:

1. Volume and scalability
BHL currently manages 2.9 million pages in its database, with each page equating to an image & its derivatives stored on a filesystem at the Internet Archive. Using uBio’s services, we’ve located a total of 14.7 million name strings across texts, with 10.4 million of those verified to an entry in NameBank.

Scalability quickly becomes an issue as BHL expects to digitize 60 million pages within 5 years. Faced with hundreds of millions of name occurrences, the challenge becomes how to efficiently store and query this dataset. BHL data are currently stored in SQL Server 2005, which can scale to expected volumes and contains tools for load balancing and clustering. Ultimately, though, these issues of volume and scalability are resolvable as the dataset is not excessively complicated in structure. With enterprise-level hardware, optimized code and data access layers, and intelligent cacheing (all of which are currently in use), BHL can efficiently store and provide access to the vast index of scientific names identified through algorithmic means.

2. OCR
Commercial Optical Character Recognition (OCR) programs, such as ABBY FineReader or PrimeOCR, work very well for texts printed after the advent of industrialized and standardized printing techniques (loosely since the late 1800’s). Unfortunately the OCR programs are considerably less accurate on texts that match the characteristics of much of what BHL is scanning, including texts printed with irregular typeface and typesetting, and texts printed in multiple languages, including Latin.

The impact here is that if the texts are not accurately recognized, the names contained within can’t be identified. The accuracy of the OCRed text is therefore incredibly important, and unfortunately nearly impossible to improve through automated means as OCR technology has not really changed much since the mid-1980’s. Alternatives such as offshore rekeying or volunteer text conversion through the Distributed Proofreaders or other crowdsourcing projects are either prohibitively expensive or would require enormous effort above and beyond what could be volunteered given BHL’s estimated page count. BHL is not alone in facing this problem; every initiative that OCRs historic texts has encountered this unfortunate gap in accuracy. If you are aware of any new efforts to improve OCR, please use the comment form below.

3. False positives
As BHL was indexing botanical texts repeated occurrences of “Ovarium” were being located; an unusual result as Ovarium is both an echinoderm (marine invertibrate) as well as a term used in botany to describe the lower part of the pistil or female organ of the flower. After reviewing the page occurrences it became clear that the TaxonFinder algorithm was accurately identifying a word and making a match to an entry in NameBank, but in this case the context was off. In nearly every entry, the word “ovarium” was not used to describe the marine invertebrate, but rather to describe the form of a flower in a taxonomic description. Similar false positives exist, such as Capsula and Fructus.

Upon further review the problem is most prevalent with names used at higher classification levels; results for “Genus species”, such as Carcharodon carcharias (Great white shark) are much less likely to be false positives. Clearly more evaluation is needed to understand the true magnitude of the problem, hopefully resulting in refinement of the TaxonFinder algorithm.

4. Usability
Gregory Crane of Tufts University asked, in an oft-cited paper, “What Do You Do With a Million Books?” The challenge facing BHL Developers (and users) is more along the lines of “What do you do with 19,000 pages containing Hymenoptera?”

Because the BHL names index is growing rapidly, the methods of viewing and filtering results in a meaningful way becomes challenging. It’s clear that a user isn’t going to manually sift through and review every one of those pages. We can facilitate downloading the results in standard forms for reference management software, such as Zotero or EndNote, but how does BHL introduce relevancy rankings or other metrics for refining results – what exactly defines relevancy for occurrences of a name throughout scientific literature?

5. Accuracy and completeness
And now for a reality check. BHL text will never be 100% accurate, and our names index will never be 100% complete. We’re using automated software and services to process the millions of pages in the BHL collection because to do anything but an automated analysis simply won’t scale. The names index and the services that support its creation and display are modular – should radically new character or word recognition software come along, the scanned images can be reprocessed and reindexed using TaxonFinder. And should a better taxonomic name finding algorithm emerge, it can replace TaxonFinder in our application. As technologies emerge to improve text transcription and indexing, BHL will evaluate them and deploy them with our app is they prove effective.

Future work
It’s clear that we’ve identified enhancements needed in TaxonFinder to reduce the number of false positives. How best to implement those enhancements is yet to be determined, but at least we have data to guide us. We also plan to enhance the interface used for the discovered bibliographies, as the current implementation is not performant for large result sets. Further, we expect to facilitate downloading of the results in a standard format, such as BibTeX.

In closing, BHL is currently employing emerging technologies to transcribe and index a large collection of digitized scientific literature, and providing innovative interfaces into the data mined from it. These interfaces are rapidly evolving to meet user needs, based on user feedback, so if you have a suggestion for improvement please provide it via our Feedback form or on the comments below.

Avatar for Chris Freeland
Written by

Chris Freeland served as the BHL Technical Director from 2006-2012. He is currently the Director of the Open Libraries program at Internet Archive. In this capacity he works with libraries & publishers to digitize their collections, working towards the Archive’s mission of providing “universal access to all knowledge.”