Wednesday, November 21, 2007

'Discovered Bibliographies' through Natural Language Processing algorithms

"Names, especially those ascribed to organisms, serve as a primary entry point into the scientific, medical, and technical literature..."
- Garrity, Lyons, 2003, Future Proofing Biological Nomenclature

A characteristic of the Biodiversity Heritage Library (BHL) that distinguishes it from other mass digitization projects is the incorporation of service-based algorithms to identify scientific name strings throughout digitized content. These 'taxonomically intelligent' services, powered by's TaxonFinder and NameBank, have been incorporated into the BHL Portal to provide names-based interfaces into taxonomic literature.

To begin a search, visit, or view an example 'discovered bibliography' for Tapirus bairdi (Baird's Tapir), including an illustration, at The ability to generate these 'discovered bibliographies' for taxa will enable users to data mine taxonomic literature for references and resources in ways not previously possible.

How it works
Each digitized page image in BHL has an accompanying OCR text file. As users navigate to a page, the uncorrected OCR file is sent to uBio's TaxonFinder, which identifies text strings that match the characteristics of Latin binomials. Those potential name strings are then compared to the 10.7 million+ names in uBio's NameBank, and the results, both matched and unmatched, are stored in the BHL database. BHL also has automated processes to reindex pages at regular intervals since NameBank is a growing repository.

What we've found
As of 20 Nov 2007 more than 6.8 million potential name strings have been identified throughout the BHL corpus, with more than 3.8 million matched to a corresponding NameBank identifier. There are more than 431,000 unique names within that 3.8 million set. Of those, more than 156,000 are known by a single occurrence. These results will be evaluated more thoroughly in the coming months to determine potential errors such as false positives and how to refine the TaxonFinder algorithm to reduce them.

Caveat: These results are generated from uncorrected OCR, which range in quality from pretty good (contemporary publications, such as modern issues of Rhodora) to downright terrible (18th century Latin texts, such as Species Plantarum). Again, further evaluation is required to determine the full scope of this problem.

Where we're headed
To see a simple example of how this can be used from external sites, check out the 'External Links' at the bottom of the Wikipedia article for Mimosa pudica L., the sensitive plant:

Up next is development of a service layer on top of the names index so that other application providers can query & display 'discovered bibliographies' within their own applications. This service will be deployed in early 2008. These services are now available for use.

Chris Freeland
chris.freeland (at) mobot (dot) org

Tuesday, November 13, 2007

Three Hundred Years of Linnaean Taxonomy

The Smithsonian's National Museum of Natural History hosted a day long symposium to celebrate 300 years of Linnaean taxonomy. In addition to the symposium, the museum featured an exhibition of a 1st Edition of Linnaeus' Systema Naturae. The exhibition, "A Tribute to Carl Linnaeus, 1707-1778" (November 13-14) features the author's own copy of Systema Naturae (courtesy of the Swedish Embassy), with illustrations by Georg Dionysius Ehret. At the evening reception, the Biodiversity Heritage Library displayed the online version of the 1758 edition of Systema (from the Missouri Botanical Garden Library) and there was also an appearance by Linnaeus [as envisioned by Hans Odöö].

The Linnaean Systema was previously on display at the LuEsther T. Mertz Library of the New York Botanical Garden (November 8-10).

Friday, November 9, 2007

Flora, Fauna, and Fine Books

The latest issue of Fine Books & Collections (November/December 2007) includes an excellent article on the Biodiversity Heritage Library. "Flora and Fauna: Creating a Global Library of Life, One Digital Page at a Time" by Rebecca Rego Barry and Scott Brown is an excellent overview of the BHL project.

Interviews with Doug Holland (Director, Missouri Botanical Garden Library), Graham Higley (Chair, BHL board and Head of Library and Information Services, Natural History Museum, London), and Tom Garnett (BHL Director).

Bibliophiles will enjoy the illustration from Herbarius (1484) from the Missouri Botanical Garden (MBG) Library collections.

Doug Holland also steps out from behind his desk to pose with a portion of the MBG rare book collection.

- Martin Kalfatovic

Wednesday, November 7, 2007

BHL Portal, an early review

This site displays an elegantly designed simplicity that the web developer in me finds irresistible. It’s a marketing nightmare, but a researcher’s dream - a system to quickly and easily find the information you want with minimal distractions.

Is this any way to run an archive? You bet it is!

The above quote is from the "Family Matters" blog, a site dedicated to family historians. And now, it's not discussing the latest and greatest genealogy site, but rather the Biodiversity Heritage Library portal.

This is an encouraging notice - not only for the fact that the outstanding usability and functionality that Chris Freeland and his team at the Missouri Botanical Garden is building into the site - but also for the fact that the BHL literature will have uses beyond the taxonomic community in areas we can't even think of at this time!
- Martin Kalfatovic