Thursday, September 11, 2008

Export of titles & scientific names in BHL now available for download

A series of files is now available for download that will enable libraries and other data providers to identify digitized titles available within BHL.

This suite of files also includes metadata about each volume scanned, as well as information about the millions of scientific names that have been identified throughout the BHL corpus and the pages on which those names occur.

Download files:
NOTE: These files represent a first cut at how we want to make data providers and libraries aware of the content within BHL. Yes, we will build services, including an OpenURL resolver, but for now our partners have asked for a low-barrier export that they can manipulate for their own specific uses. The files above are automatically generated from the BHL database on a monthly basis. The datestamp on the files themselves indicate when they were last generated.

If you are interested only in the titles we have digitized, and the items ("books" or "volumes") for each title, you only need to download the (significantly smaller) files for the following tables:
The full .zip download is not for the faint of heart! It's a monster file because it includes the export of the 27 million 36 million occurrences of scientific names (updated 3/13/2009) identified in the BHL corpus through indexing by TaxonFinder.

Finally, we are considering this version a "warts and all" export. Merging the contents of multiple library catalogues and streamlining the digitization process to avoid duplication are the biggest challenges we face in building BHL, and to be frank our metadata is far from pristine in these early stages of our project. We are building functionality that allows librarians at BHL institutions to curate these digital books in ways that make sense to both scientists and librarians and that accommodate the variety of ways in which historic works have been catalogued over time. It's a challenge we've just begun to tackle, and we look forward to any and all feedback you care to provide.

Chris Freeland
BHL Technical Director
chris dot freeland at mobot dot org

1 comment:

  1. For our virtual library of biology (vifabio), we have downloaded BHL's title data and included it in our federated search, so information on digitized taxonomic literature from BHL can be retrieved together with information on holdings of some German biology libraries and information from several bibliographic databases.

    In our first approach, we had to limit our efforts to title information from Title table and TitleIdentifier table, neglecting data on specific volumes of serials or multi-volume titles. We used Library of Congress Subject Headings (extracted from Call numbers) to enrich title data with coarse descriptive terms for many titles. There have been many difficulties with heterogeneous character encoding, making any diacritical character a problem. Nevertheless, the task was no doubt worthwile, and we see the intergration of BHL's title data as a great enhancement of our virtual catalogue ( ).

    In the future, we intend to update our downloads from time to time, and we would greatly welcome any improvements in the data, especially regarding completeness of author information, and character encoding. Of course, for an implementation like ours, a dynamic interface to query bibliographic data in BHL would be very useful. But we know that these things form difficult tasks, and that the digitization process itself will be your focus for a long time.

    (vifabio's web pages are in German language, sorry, but some of the most important pages are available in English as well, and others will be so in the near future.)