Friday, March 14, 2008

Harvesting Process from Internet Archive

NOTE: Internet Archive has changed their query interface and these instructions are no longer valid.

New instructions are available at:

The following steps are taken to download data from Internet Archive and host it on the Biodiversity Heritage Library. Diagrams of the process are available in PDF.
  1. Get item identifiers from Internet Archive for items in the "biodiversity" collection that have been recently added/updated.
  2. For each item identifier:
    • Get the list of files (XML and images) that are available for download.
    • Download the XML and image files
    • Download the scan data if it is not included with the other downloaded files
    • Extract the item metadata from the XML files and store it in the import database.
    • Extract the OCR text from the XML files and store it on the file system (one file per page).
  3. For each "approved" item, clean up and transform the metadata into an "importable" format and store the results in the import database.
  4. Read all data that is ready for import and insert/update the appropriate data in the production database.
Internet Archive Metadata Files
The following table lists the key XML files containing metadata for items hosted by Internet Archive. It is possible that one or more of these files may not exist for an item. However, most items that have been "approved" (i.e. marked as "complete" by Internet Archive) do include each of these files.




List of files that exist for the given identifier


Dublin Core metadata. In many cases the data include here overlaps with the data in the _meta.xml file.


Dublin Core metadata, as well as metadata specific to the item on IA (scan date, scanning equipment, creation date, update date, status of the item, etc)


Identifies the source of the item… not much meaningful data here


MARC data for the item.


The OCR for the item, formatted as XML.


Raw data about the scanned pages. In combination with the OCR text (_djvu.xml), the page numbers and page types can be inferred from this data. This file may not exist, though in most cases it does. For the most part, only materials added to IA prior to late summery 2007 are likely to be missing this file


Raw data about the scanned pages. If there is no _scandata file for an item, we look in (via an IA API) for this file, which contains the same information.

Internet Archive Services
Search for Items
Internet Archive items belong to one or more collections. To search a particular Internet Archive collection for items that have been updated between two dates, use the following query:


{0} = name of the Internet Archive collection; in our case, "collection:biodiversity"
{1} = start date of range of items to retrieve
{2} = end date of range of items to retrieve

To limit the item search to a particular contributing institution, modify the query as follows:
?query={0}+AND+updatedate:[{1}+TO+{2}]+AND+contributor:(MBLWHOI Library)

To limit the results of the query to a particular number of items, modify the query as follows:


To search for one particular item, use:


{0} = an Internet Archive item identifier

Download Files
To download a particular file for an Internet Archive item, use the following query:{0}/{1}


{0} = an Internet Archive item identifier
{1} = the name of the file to be downloaded

Downloading Files Contained In ZIP Archives
In some cases, a file cannot be downloaded directly, and may instead need to be extracted from a ZIP archive located at Internet Archive. One example of this is the scandata.xml file, which in some cases must be extracted from the file. To do this, two queries must be made. First invoke this query to get the physical file locations (on IA servers) for the given item:


{0} = and Internet Archive item identifier

Then, invoke the second query to extract the scandata.xml file from the file (using the physical file locations returned by the previous query):



{0} = host address for the file
{1} = directory location for the file

Note that the second query can be generalized to extract the contents of other zip files hosted at Internet Archive. The format for the query is:



{0} = host address for the file
{1} = directory location for the file
{2} = name of the zip archive from which to extract a file
{3} = the name of the file to extract from the zip archive

Documentation written by Mike Lichtenberg.

Tuesday, March 4, 2008

On Name Finding in the BHL

An important feature of the Biodiversity Heritage Library that sets it apart from other mass digitization projects is our incorporation of algorithms and services to mine taxonomically-relevant data from of the 2.9 million (as of the date of this posting) pages digitized through our partnership with the Internet Archive. These services, including TaxonFinder, developed by partners at, allow BHL to identify words in digitized literature that match the characteristics of latin-based scientific names, then verify accuracy of the word or words being a scientific name by comparing them to NameBank,'s repository of more than 10.7 million recorded scientific names and their variants. The resulting index of names found throughout these historic texts is an incredibly valuable dataset, whose richness and use has just begun development.

The massive index and interfaces to it are new (from development to production within 8 weeks), so the BHL Development Team has been gathering feedback from users, evaluating usage statistics, and working with both librarians and scientists to determine what is working with the interface and what needs refinement. The following issues have been identified:

1. Volume and scalability
BHL currently manages 2.9 million pages in its database, with each page equating to an image & its derivatives stored on a filesystem at the Internet Archive. Using uBio's services, we've located a total of 14.7 million name strings across texts, with 10.4 million of those verified to an entry in NameBank.

Scalability quickly becomes an issue as BHL expects to digitize 60 million pages within 5 years. Faced with hundreds of millions of name occurrences, the challenge becomes how to efficiently store and query this dataset. BHL data are currently stored in SQL Server 2005, which can scale to expected volumes and contains tools for load balancing and clustering. Ultimately, though, these issues of volume and scalability are resolvable as the dataset is not excessively complicated in structure. With enterprise-level hardware, optimized code and data access layers, and intelligent cacheing (all of which are currently in use), BHL can efficiently store and provide access to the vast index of scientific names identified through algorithmic means.

2. OCR

Commercial Optical Character Recognition (OCR) programs, such as ABBY FineReader or PrimeOCR, work very well for texts printed after the advent of industrialized and standardized printing techniques (loosely since the late 1800's). Unfortunately the OCR programs are considerably less accurate on texts that match the characteristics of much of what BHL is scanning, including texts printed with irregular typeface and typesetting, and texts printed in multiple languages, including Latin.

The impact here is that if the texts are not accurately recognized, the names contained within can't be identified. The accuracy of the OCRed text is therefore incredibly important, and unfortunately nearly impossible to improve through automated means as OCR technology has not really changed much since the mid-1980's. Alternatives such as offshore rekeying or volunteer text conversion through the Distributed Proofreaders or other crowdsourcing projects are either prohibitively expensive or would require enormous effort above and beyond what could be volunteered given BHL's estimated page count. BHL is not alone in facing this problem; every initiative that OCRs historic texts has encountered this unfortunate gap in accuracy. If you are aware of any new efforts to improve OCR, please use the comment form below.

3. False positives
As BHL was indexing botanical texts repeated occurrences of "Ovarium" were being located; an unusual result as Ovarium is both an echinoderm (marine invertibrate) as well as a term used in botany to describe the lower part of the pistil or female organ of the flower. After reviewing the page occurrences it became clear that the TaxonFinder algorithm was accurately identifying a word and making a match to an entry in NameBank, but in this case the context was off. In nearly every entry, the word "ovarium" was not used to describe the marine invertebrate, but rather to describe the form of a flower in a taxonomic description. Similar false positives exist, such as Capsula and Fructus.

Upon further review the problem is most prevalent with names used at higher classification levels; results for "Genus species", such as Carcharodon carcharias (Great white shark) are much less likely to be false positives. Clearly more evaluation is needed to understand the true magnitude of the problem, hopefully resulting in refinement of the TaxonFinder algorithm.

4. Usability
Gregory Crane of Tufts University asked, in an oft-cited paper, "What Do You Do With a Million Books?" The challenge facing BHL Developers (and users) is more along the lines of "What do you do with 19,000 pages containing Hymenoptera?"

Because the BHL names index is growing rapidly, the methods of viewing and filtering results in a meaningful way becomes challenging. It's clear that a user isn't going to manually sift through and review every one of those pages. We can facilitate downloading the results in standard forms for reference management software, such as Zotero or EndNote, but how does BHL introduce relevancy rankings or other metrics for refining results - what exactly defines relevancy for occurrences of a name throughout scientific literature?

5. Accuracy and completeness
And now for a reality check. BHL text will never be 100% accurate, and our names index will never be 100% complete. We're using automated software and services to process the millions of pages in the BHL collection because to do anything but an automated analysis simply won't scale. The names index and the services that support its creation and display are modular - should radically new character or word recognition software come along, the scanned images can be reprocessed and reindexed using TaxonFinder. And should a better taxonomic name finding algorithm emerge, it can replace TaxonFinder in our application. As technologies emerge to improve text transcription and indexing, BHL will evaluate them and deploy them with our app is they prove effective.

Future work
It's clear that we've identified enhancements needed in TaxonFinder to reduce the number of false positives. How best to implement those enhancements is yet to be determined, but at least we have data to guide us. We also plan to enhance the interface used for the discovered bibliographies, as the current implementation is not performant for large result sets. Further, we expect to facilitate downloading of the results in a standard format, such as BibTeX.

In closing, BHL is currently employing emerging technologies to transcribe and index a large collection of digitized scientific literature, and providing innovative interfaces into the data mined from it. These interfaces are rapidly evolving to meet user needs, based on user feedback, so if you have a suggestion for improvement please provide it via our Feedback form or on the comments below.


Sunday, March 2, 2008

A Leap for All Life: BHL & EOL

Originally uploaded by martin_kalfatovic
The Biodiversity Heritage Library and the Encyclopedia of Life shared a table at the Congressional Family Night held at the Smithsonian's National Museum of Natural History.

The event (March 1, 2008) showcased a wide range of scientific endeavors engaged in by Smithsonian staff and was attended by members of Congress, their staff, and families.

Here, Cristián Samper, Acting Secretary of the Smithsonian and EOL steering committee member looks on as Gil Taylor (Smithsonian Institution Libraries) and Dawn Mason (EOL) demonstrate the recently launched EOL species pages.