‘Discovered Bibliographies’ through Natural Language Processing algorithms

“Names, especially those ascribed to organisms, serve as a primary entry point into the scientific, medical, and technical literature…”
– Garrity, Lyons, 2003, Future Proofing Biological Nomenclature

A characteristic of the Biodiversity Heritage Library (BHL) that distinguishes it from other mass digitization projects is the incorporation of service-based algorithms to identify scientific name strings throughout digitized content. These ‘taxonomically intelligent’ services, powered by uBio.org’s TaxonFinder and NameBank, have been incorporated into the BHL Portal to provide names-based interfaces into taxonomic literature.

To begin a search, visit http://www.biodiversitylibrary.org/NameSearch.aspx, or view an example ‘discovered bibliography’ for Tapirus bairdi (Baird’s Tapir), including an illustration, at http://www.biodiversitylibrary.org/name/Tapirus_bairdi. The ability to generate these ‘discovered bibliographies’ for taxa will enable users to data mine taxonomic literature for references and resources in ways not previously possible.

How it works
Each digitized page image in BHL has an accompanying OCR text file. As users navigate to a page, the uncorrected OCR file is sent to uBio’s TaxonFinder, which identifies text strings that match the characteristics of Latin binomials. Those potential name strings are then compared to the 10.7 million+ names in uBio’s NameBank, and the results, both matched and unmatched, are stored in the BHL database. BHL also has automated processes to reindex pages at regular intervals since NameBank is a growing repository.

What we’ve found
As of 20 Nov 2007 more than 6.8 million potential name strings have been identified throughout the BHL corpus, with more than 3.8 million matched to a corresponding NameBank identifier. There are more than 431,000 unique names within that 3.8 million set. Of those, more than 156,000 are known by a single occurrence. These results will be evaluated more thoroughly in the coming months to determine potential errors such as false positives and how to refine the TaxonFinder algorithm to reduce them.

Caveat: These results are generated from uncorrected OCR, which range in quality from pretty good (contemporary publications, such as modern issues of Rhodora) to downright terrible (18th century Latin texts, such as Species Plantarum). Again, further evaluation is required to determine the full scope of this problem.

Where we’re headed
To see a simple example of how this can be used from external sites, check out the ‘External Links’ at the bottom of the Wikipedia article for Mimosa pudica L., the sensitive plant:
http://en.wikipedia.org/wiki/Mimosa_pudica

Up next is development of a service layer on top of the names index so that other application providers can query & display ‘discovered bibliographies’ within their own applications. This service will be deployed in early 2008. These services are now available for use.

Avatar for Chris Freeland
Written by

Chris Freeland served as the BHL Technical Director from 2006-2012. He is currently the Director of the Open Libraries program at Internet Archive. In this capacity he works with libraries & publishers to digitize their collections, working towards the Archive’s mission of providing “universal access to all knowledge.”