SHARE

Sunday, December 7, 2008

COinS integrated to support Zotero, other reference management software

We have pushed a change to the BHL Portal user interface to enhance usability of our books. The new page is available at:
http://www.biodiversitylibrary.org/bibliography/4323

Our goal with the improvement was to make the page more visually informative and, at the same time, easier to understand. We also took this opportunity to add in code that makes BHL more readily indexed by Zotero and other reference management applications. We've embedded COinS (ContextObject in Span) into the page above, as well as the pageturning view, at:
http://www.biodiversitylibrary.org/item/23172

COinS are snippets of bibliographic metadata embedded in a page using a span tag (hence the name). Reference management software, especially Zotero, use these snippets to automatically populate an entry, so that BHL users can be building reference lists within their citation managers as they're using our site. Here's what COinS look like:

<span class="Z3988" title="ctx_ver=Z39.88-2004 &amp;rft_id=info%3aoclcnum%2f1903126 &amp;rft_id=http%3a%2f%2fwww.biodiversitylibrary.org%2fitem%2f23172 &amp;rft_val_fmt=info%3aofi%2ffmt%3akev%3amtx%3abook
&amp;rft.genre=book &amp;rft.btitle=At+last%3a+a+Christmas+in+the+West+Indies.+ &amp;rft.place=London%2c &amp;rft.pub=Macmillan+and+co.%2c &amp;rft.aufirst=Charles &amp;rft.aulast=Kingsley &amp;rft.au=Kingsley%2c+Charles%2c
&amp;rft.pages=1-352 &amp;rft.tpages=352"></span>


Unfortunately there is a known issue in Zotero using COinS to describe journals themselves, so users get an error when trying to add a page like the following to Zotero:
http://www.biodiversitylibrary.org/bibliography/8188

The workaround invalidates the COinS standard, based on OpenURL, so we decided to err on the side of good metadata and standards compliance and publish the COinS correctly. There are other ways of making Zotero work for journals, which we will implement in the next UI release.

Please comment below with suggestions for improvement or issues concerning these updates.

Chris Freeland

Monday, November 24, 2008

10,000,000 pages!

Sometime over the past weekend, the Biodiversity Heritage Library portal loaded it's 10 millionth page!

Due to the way volumes are ingested from the various scanning centers, it's a bit tricky to pick which was the EXCACT 10 millionth page, but for the sake of this blog post, I'm going to say that it was one of the pages of Coleopterorum catalogus by Junk and Schenkling. I'm picking this item because, as you taxonomic cognoscenti out there know, beetles (coleoptera) represent, perhaps, the most common type of animal. Indeed, the noted biologist J.B.S. Haldane is reputed to have quipped that, if nothing else, nature reveals that God has "an inordinate fondness for beetles."

But I digress ...

Ten million is a big number, but still a magnitude of difference from the, perhaps, 60-100 million that the Biodiversity Heritage Library hopes to make available in the coming years. Those 10 million pages represent nearly 25,000 volumes (and about 8,700 titles) - a respectable, but not massive library of taxonimic literature. We've learned a lot along the way (factoid: taxonomic literature seems to be running at about an average of 417 pages per volume) and are working on ways to expand what we can digitize and how.

Thanks to all the staff in the contributing libraries and our scanning partner (the Internet Archive) for getting us to the 10 million mark.

Details of the "10 millionth page"

Coleopterorum catalogus.

Brief | Detailed | MARC

By:
Junk, Wilhelm, 1866-1942
Schenkling, Sigmund, 1865-

Publication info:
Berlin :W. Junk,1910-1940.

Call Number:
QL571 .C692

Subjects:
Beetles, Periodicals

Contributing Library:
Smithsonian Institution Libraries

Monday, October 20, 2008

An evaluation of taxonomic name finding

Starting this past June, BHL worked with Qin Wei, a Ph.D. student in Library and Information Science at the University of Illinois Urbana-Champaign, to evaluate the taxonomic name finding software and algorithms used to identify scientific names throughout the BHL corpus. This work lead to some interesting findings, which were reported this week via poster and oral presentation at the Biodiversity Information Standards (TDWG) 2008 conference in Fremantle, Australia.

View Presentation


Methodology
  • Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL (0.01% of the BHL corpus at the time of the study).
  • Compared those names against OCR text, then two name finding algorithms (TaxonFinder & FAT)
Characteristics of the sample
  • Number of Pages: 392
  • Average Number of Words per Page: 446.8
  • Average Number of Names per Page: 7.7
  • Total Number of Names: 3003
  • Total Number of Unique Names: 2610
OCR Errors
  • Of the 3,003 names, 1,056 were incorrectly transcribed by OCR, for an error rate of
    35.16%
  • Top OCR errors
    1 Insert Space
    2 Omit Space
    3 e->c
    4 u->I
    5 u->n
    6 i->l
    7 c->e
    8 n->v
    9 l->i
    10 r->i
    11 u->ii
    12 h->l
    13 h->ii
    14 e->o
Performances of algorithms
  • TaxonFinder
    • Excluding names with OCR errors
      • Precision 40.32%
        Recall 36.62%
        F-score 38.47%
    • Including names with OCR errors
      • Precision 43.77%
        Recall 25.82%
        F-score 34.80%
  • FAT
    • Excluding names with OCR errors
      • Precision 28.20%
        Recall 23.34%
        F-score 25.77%
    • Including names with OCR errors
      • Precision 32.25%
        Recall 17.21%
        F-score 24.73%
Considerations
  • Improving OCR software is out of current scope for BHL
    • investigations into Tesseract may be worthwhile
  • Rekeying is too expensive and will not scale
Recommendations
  • Enhance “fuzzy” retrieval in algorithms
    • Exception rules to overcome OCR errors
  • More work needed in this space
    • More evaluations & experiments
    • Robust training sets
      • reCAPTCHA for names?
For additional information about the study:
Questions about this study should be commented below for wider visibility than e-mail correspondence.

Thursday, September 11, 2008

Export of titles & scientific names in BHL now available for download

A series of files is now available for download that will enable libraries and other data providers to identify digitized titles available within BHL.

This suite of files also includes metadata about each volume scanned, as well as information about the millions of scientific names that have been identified throughout the BHL corpus and the pages on which those names occur.

Download files:
NOTE: These files represent a first cut at how we want to make data providers and libraries aware of the content within BHL. Yes, we will build services, including an OpenURL resolver, but for now our partners have asked for a low-barrier export that they can manipulate for their own specific uses. The files above are automatically generated from the BHL database on a monthly basis. The datestamp on the files themselves indicate when they were last generated.

If you are interested only in the titles we have digitized, and the items ("books" or "volumes") for each title, you only need to download the (significantly smaller) files for the following tables:
The full .zip download is not for the faint of heart! It's a monster file because it includes the export of the 27 million 36 million occurrences of scientific names (updated 3/13/2009) identified in the BHL corpus through indexing by TaxonFinder.

Finally, we are considering this version a "warts and all" export. Merging the contents of multiple library catalogues and streamlining the digitization process to avoid duplication are the biggest challenges we face in building BHL, and to be frank our metadata is far from pristine in these early stages of our project. We are building functionality that allows librarians at BHL institutions to curate these digital books in ways that make sense to both scientists and librarians and that accommodate the variety of ways in which historic works have been catalogued over time. It's a challenge we've just begun to tackle, and we look forward to any and all feedback you care to provide.

Chris Freeland
BHL Technical Director
chris dot freeland at mobot dot org

Tuesday, August 19, 2008

But where are the articles??

Many researchers are used to searching or browsing for materials by article. Article level access to BHL content is a goal that we're striving for, and one that we haven't yet reached!

BHL is a mass scanning operation. Our member libraries are moving as quickly as possible through a range of materials - books, serials, etc. - in order to scan as much as possible during our relatively brief window of funding. Our goal is to scan & cache now, then add in advanced technology solutions for secondary post-processing as they are developed.

We've found that in scanning historic scientific monographs and journals, article identification is too labor intensive (and expensive) to do by hand. BHL staff, through connections formed by our scanning partner, Internet Archive, have been working with Penn State's College of Information Sciences and Technology (the developers behind CiteSeer) to provide a test bed algorithm to extract article metadata from historic literature.

There are a number of challenges in our digitized historic literature that cause even the most scalable, sophisticated algorithms to return inaccurate results. These include:
  • uncorrected source OCR (accuracy is problematic)
  • multiple foreign languages, including Latin
  • irregular printing processes and type setting in historic literature
  • change of printing process and issue frequency during course of a journal run
Still, progress is being made:
  • Penn State's algorithms have been demonstrated, as in this example
  • They need access to a wider testbed for improved machine learning, which is now available (7.4 million pages in BHL as of this writing)
But the work is far from finished. Next steps:
  • need to refine algorithms
  • need interfaces for human editing
    • too many possible inaccuracies upstream
    • possibly distribute this task via Mechanical Turk or other 'clickworking' network?
  • define workflow
    • first pass: algorithms; second pass: volunteers; editorial review?
But what about Google, you say? They're scanning books en masse. They're smart. Haven't they solved the problem?? The quick answer is "No, not really." Take a look at the "Contents" section for:
Zoologist: A Monthly Journal of Natural History, ser.4 v.12 1908

This is what Google can do...with all of their resources & grey matter. BHL is many things to many people, but we're certainly not Google!

And here's where we need your input:
So what can you, an enthusiastic supporter of BHL who wants access to articles in our collection, do to help? We're glad you asked!
  1. Regardless of the means by which we actually get the article metadata, we've got to store that alongside all of our other content in BHL. We have released an UPDATED (9/3/2008) data model supporting articles for review & comment, guided by our research into NLM, OpenURL, and OAI-ORE, and with help from the expert developers behind the Public Library of Science.
  2. We need to hear how you, enthusiastic BHL supporter, expect to access and use article-based content. We're looking for information about the sites you use and like with similar content, as well as your general expectations for delivery of articles in BHL. We know that's a wide open question; here's your chance to bend our ear(s).
Please use the "Comment" feature below to drop off your suggestions and ideas for both topics. Or, if you'd prefer to keep your opinions private, e-mail them to chris (dot) freeland (at) mobot (dot) org.

Looking forward to the feedback, and to providing this important method of access to the wealth of content in BHL.

Chris Freeland
Technical Director, BHL

Tuesday, August 5, 2008

Revised BHL Data Model

The latest revision of the BHL Data Model is now available for review at:
http://www.biodiversitylibrary.org/documents/BHLDataModel_20080805.pdf

Tuesday, July 22, 2008

Revised BHL Architecture

A revised diagram & description of the BHL hardware architecture is available at:

http://www.slideshare.net/chrisfreeland/bhl-architecture-july-2008/

Friday, June 13, 2008

Updated Harvesting Process from the Internet Archive

Note: This is a revision of our previous blog post that described our process for harvesting digitized books from the Internet Archive. Their query interface changed, and we've updated our process & documentation accordingly.

Disclaimer: BHL is not directly or indirectly involved with the development of this query interface. We scan books through Internet Archive and are consumers of their services & interfaces. We have provided this documentation to help inform others of our process. Questions or comments concerning the query interface, results returned, etc., should be directed to the Internet Archive.

-Chris Freeland, BHL Technical Director


Overview
The following steps are taken to download data from Internet Archive and host it on the Biodiversity Heritage Library. Diagrams of the process are available in PDF.
  1. Get item identifiers from Internet Archive for items in the "biodiversity" collection that have been recently added/updated.
  2. For each item identifier:
    • Get the list of files (XML and images) that are available for download.
    • Download the XML and image files
    • Download the scan data if it is not included with the other downloaded files
    • Extract the item metadata from the XML files and store it in the import database.
    • Extract the OCR text from the XML files and store it on the file system (one file per page).
  3. For each "approved" item, clean up and transform the metadata into an "importable" format and store the results in the import database.
  4. Read all data that is ready for import and insert/update the appropriate data in the production database.
Internet Archive Metadata Files
The following table lists the key XML files containing metadata for items hosted by Internet Archive. It is possible that one or more of these files may not exist for an item. However, most items that have been "approved" (i.e. marked as "complete" by Internet Archive) do include each of these files.

Filename

Description

*iaidentifier*_files.xml

List of files that exist for the given identifier

*iaidentifier*_dc.xml

Dublin Core metadata. In many cases the data include here overlaps with the data in the _meta.xml file.

*iaidentifier*_meta.xml

Dublin Core metadata, as well as metadata specific to the item on IA (scan date, scanning equipment, creation date, update date, status of the item, etc)

*iaidentifier*_metasource.xml

Identifies the source of the item… not much meaningful data here

*iaidentifier*_marc.xml

MARC data for the item.

*iaidentifier*_djvu.xml

The OCR for the item, formatted as XML.

*iaidentifier*_scandata.xml

Raw data about the scanned pages. In combination with the OCR text (_djvu.xml), the page numbers and page types can be inferred from this data. This file may not exist, though in most cases it does. For the most part, only materials added to IA prior to late summery 2007 are likely to be missing this file

*iaidentifier*_scandata.xml

Raw data about the scanned pages.If there is no *iaidentifier*_scandata file for an item, we look in scandata.zip (via an IA API) for this file, which contains the same information.



Internet Archive Services
Search for Items
Internet Archive items belong to one or more collections. To search a particular Internet Archive collection for items that have been updated between two dates, use the following query:

http://www.archive.org/advancedsearch.php
?q={0}+AND+oai_updatedate:[{1}+TO+{2}]
&fl[]=identifier&fl[]=oai_updatedate&fmt=xml&xmlsearch=Search

where

{0} = name of the Internet Archive collection; in our case, "collection:biodiversity"
{1} = start date of range of items to retrieve (YYYY-MM-DD)
{2} = end date of range of items to retrieve (YYYY-MM-DD)

To limit the item search to a particular contributing institution, modify the query as follows:

http://www.archive.org/advancedsearch.php
?q={0}+AND+oai_updatedate:[{1}+TO+{2}]+AND+contributor:(MBLWHOI Library)
&fl[]=identifier&fl[]=oai_updatedate
&rows=100000&fmt=xml&xmlsearch=Search

To limit the results of the query to a particular number of items, modify the query as follows:

http://www.archive.org/advancedsearch.php
?q={0}+AND+oai_updatedate:[{1}+TO+{2}]
&fl[]=identifier&fl[]=oai_updatedate
&rows=100000&fmt=xml&xmlsearch=Search

To search for one particular item, use:

http://www.archive.org/advancedsearch.php
?q={0}&fl[]=identifier&fl[]=oai_updatedate
&fmt=xml&xmlsearch=Search

where

{0} = an Internet Archive item identifier

Download Files
To download a particular file for an Internet Archive item, use the following query:

http://www.archive.org/download/{0}/{1}

where

{0} = an Internet Archive item identifier
{1} = the name of the file to be downloaded

Downloading Files Contained In ZIP Archives
In some cases, a file cannot be downloaded directly, and may instead need to be extracted from a ZIP archive located at Internet Archive. One example of this is the scandata.xml file, which in some cases must be extracted from the scandata.zip file. To do this, two queries must be made. First invoke this query to get the physical file locations (on IA servers) for the given item:

http://www.archive.org/services/find_file.php
?file={0}
&loconly=1


where

{0} = and Internet Archive item identifier

Then, invoke the second query to extract the scandata.xml file from the scandata.zip file (using the physical file locations returned by the previous query):

http://{0}/zipview.php
?zip={1}/scandata.zip
&file=scandata.xml


where

{0} = host address for the file
{1} = directory location for the file

Note that the second query can be generalized to extract the contents of other zip files hosted at Internet Archive. The format for the query is:

http://{0}/zipview.php
?zip={1}/{2}
&file={3}.jpg


where

{0} = host address for the file
{1} = directory location for the file
{2} = name of the zip archive from which to extract a file
{3} = the name of the file to extract from the zip archive


Documentation written by Mike Lichtenberg.

Wednesday, June 4, 2008

WonderFetch(tm) & IA _meta.xml fields

Overview

WonderFetch is the term used for prepopulating the Internet Archives metadata forms (so named because it is more wonderful than regular z39.50 fetching). Using WonderFetch, partner libraries can populate fields with data that would not normally be populated as part of the standard IA process, and then store those values in the foobar_meta.xml file alongside each scanned item in the IA repository. Part of the impetus for implementing WonderFetch was not just to automate the inclusion of volume and issue information for serials – which was important – but to also capture due diligence, rights, and licensing information related to each item. (And yes, the TM is a little joke! No rights reserved).


How does it work?

WonderFetch is simply passing a series of parameters to the IA metaform software in a URL string. If you can do a Z39.50 query against your ILS and get your data into a format that will let you generate a URL (say, an HTML page output from a database, or a spreadsheet with your item data in it) you can WonderFetch!

To create WonderFetch links from your data, just append the relevant arguments and data (listed below) to one of 2 base URLs.

If your books are being “loaded”, that is, if the metadata fetch is occurring on a scribe2 machine and the scanner is using the WonderFetch while at the SCRIBE machine, use http://localhost.archive.org/biblio.php? as the base URL.

If your books are being pre-loaded or batch loaded on another computer before being scanned, or your SCRIBE is using the scribe1 software, use http://www.us.archive.org/biblio.php? as the base URL. Using this URL, your scanner person will notice that they can't actually start shooting the book - this link only allows them to fetch metadata and create the foobar_meta.xml and marc.xml records.

For example, at SIL the scanner loads the books at the scribe station, and we Z-fetch on our barcode number (starts with 39088) so our URLs look like:

http://localhost.archive.org/biblio.php?f=c
&b_c1=biodiversity&b_l=Smithsonian%20Institution%20Libraries
&b_p=Smithsonian&z_e=Smithsonian%20Institution
&b_v=v.%209%201907&b_cn=no
&z_c=local&z_d=39088009080136&b_ib=39088009080136



Below are the arguments each field takes, along with examples. The examples given are values BHL partners will be using (for rights statement, e.g.).


(For additional reference, some definitions and usage of the standard IA fields can be found in: http://www.us.archive.org/biblio?f=usage )





LIST OF FIELDS in META_XML that can be pre-populated using WonderFetch:


CALL_NUMBER

Description: Number for ZQuery – this is *not* necessarily the call number of the item. It *is* the number used to fetch the MARC record for the item via Z39.50 from your ILS. Whatever works for you – barcode, bib number, oclc number, call number, etc.
Prepopulatable: Yes.
WonderFetch GET arg: "&z_d="
example: &z_d=39088009080136

IDENTIFIER-BIB
Description: Unique identifier for the item in Contributor library's catalog.
Prepopulatable: Yes
WonderFetch GET arg: "&b_ib="
example: &b_ib=39088009080136

TITLE-ID
Description: Unique identifier for title in Contributor library's catalog.
Prepopulatable: Yes
WonderFetch GET arg: "&tid="
example: &tid=b12345

VOLUME
Description: Volume of Book being scanned.
Prepopulatable: Yes.
WonderFetch GET arg: &b_v=
example: &b_v=v.%209%201907 (v. 9 1907)

YEAR
Description: Year assigned to Book being scanned
Prepopulatable: Yes.
WonderFetch GET arg: &year=
example: &year=1907


COLLECTION

Description: Collection(s) into which Book will be sorted.
Prepopulatable: Yes
WonderFetch GET arg: "&b_c1=" (c1 indicates primary collection, there is also c2 and c3)
example: &b_c1=biodiversity

CONTRIBUTOR

Description: Library contributing Book for scanning
Prepopulatable: Yes
WonderFetch GET arg: "&b_l="
example: &b_l=Smithsonian%20Institution%20Libraries

SPONSOR
Description: Organization responsible for funding scanning
Prepopulatable: Yes
WonderFetch GET arg: "&b_p="
example: "&b_p=Sloan"

SCANNINGCENTER
Description: Scanning Center where book was scanned.
Prepopulatable: yes
Get arg: "&b_n="
example: &b_n=Boston

DUE-DILIGENCE
Description: URL to Due Diligence statement for a Book scanned while still in Copyright.
Prepopulatable: Yes
WonderFetch GET arg: &dd=
examples: &dd=dd-bhl (this signifies a due diligence statement exists at the following url http://www.biodiversitylibrary.org/permissions )

LICENSE-TYPE
Description: Creative Commons License assigned to Book scanned.
Prepopulatable: Yes
WonderFetch GET arg: "&lic="
&lic= by ( http://creativecommons.org/licenses/by/3.0/)
&lic= by-nc ( http://creativecommons.org/licenses/by-nc/3.0/)
&lic= by-nd ( http://creativecommons.org/licenses/by-nd/3.0/)
&lic= by-sa (http://creativecommons.org/licenses/by-sa/3.0/)
&lic= by-nc-nd ( http://creativecommons.org/licenses/by-nc-nd/3.0/)
&lic= by-nc-sa ( http://creativecommons.org/licenses/by-nc-sa/3.0/)

NEGOTIATED-RIGHTS
Description: URL to Negotiated Rights for a Book scanned while still in Copyright.
Prepopulatable: Yes
WonderFetch GET arg: "&rights="
&rights=nr-bhl (this signifies that a statement of right to digitize exists at the following URL: http://www.biodiversitylibrary.org/permissions/)

POSSIBLE-COPYRIGHT-STATUS
Description: indicates copyright status, defaults to Not in Copyright
Prepopulatable:yes
Get arg: "&pcs="
To make this field blank, because it *is* in copyright, but you have permission to digitize, pass a url encoded space as the parameter, e.g. &pcs=%20

- Keri Thompson, Smithsonian Libraries (thompsonk@si.edu)

Friday, April 25, 2008

Better maps! More bibliographic detail!

New BHL Map interface

Following some excellent suggestions gathered at a recent Encyclopedia of Life meeting, we've made changes to our Google Maps browse interface. To recap, we take Library of Congress Subject Headings and geocode and map them using the Google Maps API (details here).

Now that we're managing nearly 10,000 volumes the standard Google Maps interface was getting cluttered and clunky, so we've refined the interface to show smaller points, weight the results using color, and display links to the titles for a given subject heading within the map itself (as demonstrated for "Africa" above). To view the map in full, visit http://www.biodiversitylibrary.org/browse/map.

We made another change based on requests to view full bibliographic details for a scanned title. When we harvest scans from the Internet Archive, we copy the MARCXML for the title to our servers and siphon off just enough of the metadata to facilitate our browse & search capabilities - to pull in the contents of the entire MARCXML would unnecessarily bloat our database with info we don't expect to search across or expose via browse. But, it's important data to have in the display, so we've skinned the MARCXML using XSLTs provided by the Library of Congress. To view in action, click the "Brief|Detailed|MARC" links at http://www.biodiversitylibrary.org/bibliography/1583, or for any title in our collection.

Finally, we've enhanced the display for our Discovered Bibliographies to return results in a more performant way, providing more visual feedback to the user that processes are at work. To view the refined interface, visit the result for Pomatomus saltatrix at http://www.biodiversitylibrary.org/name/Pomatomus_saltatrix.

Tuesday, April 15, 2008

BHL Portal Updates!

The BHL portal (http://www.biodiversitylibrary.org) has been updated with the following changes:

  • A new option has been added to filter results by the language in which items are published. For example, Titles published in English, or Authors with works published in German. This option complements the pre-existing option to filter results by contributing institution.
  • An advanced search page (http://www.biodiversitylibrary.org/advancedsearch.aspx) has been added. This page allows a user to search on any combination of search categories (Titles, Authors, Names, or Subjects), instead of just one or all of the categories. It also allows search results to be limited by the publishing language.
  • OCLC numbers associated with each title have been cleaned up. This means that the “Find in a local library” link on each title’s bibliography page should now work correctly.
  • Sorting of individual items within a single title has been improved. See the right side of this page (http://www.biodiversitylibrary.org/bibliography/702) for an example. You can see that the volumes are listed in order, v1 to v 92. Prior to the correction, the volumes were sorting as follows: v1, v10, v11, v12, … v2, v20, v21, v22, … v3, v30, v31, v32.
  • Call numbers, when available, should now display correctly on bibliography pages.
  • Some minor updates have been made to the page that displays the discovered bibliography for a name. An example is http://www.biodiversitylibrary.org/name/Poa_annua. Changes have been made to retrieve data as needed, instead of all at once, which has improved performance greatly. However, there remains a lengthy delay in retrieving large data sets, so we know that more work is needed here. At a minimum, we know that we need to improve the feedback given to the user while large data sets are being retrieved.

Also of note:

Some inconsistencies with title information have been identified. We had believed that the MARC leader assigned to an item would be sufficient to uniquely identify a title. This has turned out to not be the case (affecting about one-half of one percent of the titles we’ve ingested from Internet Archive), so we’ve had to adjust how we identify which items belong to which titles. The cleanup of this data is ongoing.

- Mike Lichtenberg

Friday, March 14, 2008

Harvesting Process from Internet Archive

NOTE: Internet Archive has changed their query interface and these instructions are no longer valid.

New instructions are available at:
http://biodiversitylibrary.blogspot.com/2008/06/updated-harvesting-process-from.html

Overview
The following steps are taken to download data from Internet Archive and host it on the Biodiversity Heritage Library. Diagrams of the process are available in PDF.
  1. Get item identifiers from Internet Archive for items in the "biodiversity" collection that have been recently added/updated.
  2. For each item identifier:
    • Get the list of files (XML and images) that are available for download.
    • Download the XML and image files
    • Download the scan data if it is not included with the other downloaded files
    • Extract the item metadata from the XML files and store it in the import database.
    • Extract the OCR text from the XML files and store it on the file system (one file per page).
  3. For each "approved" item, clean up and transform the metadata into an "importable" format and store the results in the import database.
  4. Read all data that is ready for import and insert/update the appropriate data in the production database.
Internet Archive Metadata Files
The following table lists the key XML files containing metadata for items hosted by Internet Archive. It is possible that one or more of these files may not exist for an item. However, most items that have been "approved" (i.e. marked as "complete" by Internet Archive) do include each of these files.

Filename

Description

_files.xml

List of files that exist for the given identifier

_dc.xml

Dublin Core metadata. In many cases the data include here overlaps with the data in the _meta.xml file.

_meta.xml

Dublin Core metadata, as well as metadata specific to the item on IA (scan date, scanning equipment, creation date, update date, status of the item, etc)

_metasource.xml

Identifies the source of the item… not much meaningful data here

_marc.xml

MARC data for the item.

_djvu.xml

The OCR for the item, formatted as XML.

_scandata.xml

Raw data about the scanned pages. In combination with the OCR text (_djvu.xml), the page numbers and page types can be inferred from this data. This file may not exist, though in most cases it does. For the most part, only materials added to IA prior to late summery 2007 are likely to be missing this file

scandata.xml

Raw data about the scanned pages. If there is no _scandata file for an item, we look in scandata.zip (via an IA API) for this file, which contains the same information.


Internet Archive Services
Search for Items
Internet Archive items belong to one or more collections. To search a particular Internet Archive collection for items that have been updated between two dates, use the following query:

http://www.archive.org/services/search.php
?query={0}+AND+updatedate:[{1}+TO+{2}]
&submit=submit

where

{0} = name of the Internet Archive collection; in our case, "collection:biodiversity"
{1} = start date of range of items to retrieve
{2} = end date of range of items to retrieve

To limit the item search to a particular contributing institution, modify the query as follows:

http://www.archive.org/services/search.php
?query={0}+AND+updatedate:[{1}+TO+{2}]+AND+contributor:(MBLWHOI Library)
&submit=submit

To limit the results of the query to a particular number of items, modify the query as follows:

http://www.archive.org/services/search.php
?query={0}+AND+updatedate:[{1}+TO+{2}]
&limit=1000

&submit=submit

To search for one particular item, use:

http://www.archive.org/services/search.php
?query={0}
&submit=submi
t

where

{0} = an Internet Archive item identifier

Download Files
To download a particular file for an Internet Archive item, use the following query:

http://www.archive.org/download/{0}/{1}

where

{0} = an Internet Archive item identifier
{1} = the name of the file to be downloaded

Downloading Files Contained In ZIP Archives
In some cases, a file cannot be downloaded directly, and may instead need to be extracted from a ZIP archive located at Internet Archive. One example of this is the scandata.xml file, which in some cases must be extracted from the scandata.zip file. To do this, two queries must be made. First invoke this query to get the physical file locations (on IA servers) for the given item:

http://www.archive.org/services/find_file.php
?file={0}
&loconly=1


where

{0} = and Internet Archive item identifier

Then, invoke the second query to extract the scandata.xml file from the scandata.zip file (using the physical file locations returned by the previous query):

http://{0}/zipview.php
?zip={1}/scandata.zip
&file=scandata.xml


where

{0} = host address for the file
{1} = directory location for the file

Note that the second query can be generalized to extract the contents of other zip files hosted at Internet Archive. The format for the query is:

http://{0}/zipview.php
?zip={1}/{2}
&file={3}.jpg


where

{0} = host address for the file
{1} = directory location for the file
{2} = name of the zip archive from which to extract a file
{3} = the name of the file to extract from the zip archive


Documentation written by Mike Lichtenberg.

Tuesday, March 4, 2008

On Name Finding in the BHL

An important feature of the Biodiversity Heritage Library that sets it apart from other mass digitization projects is our incorporation of algorithms and services to mine taxonomically-relevant data from of the 2.9 million (as of the date of this posting) pages digitized through our partnership with the Internet Archive. These services, including TaxonFinder, developed by partners at uBio.org, allow BHL to identify words in digitized literature that match the characteristics of latin-based scientific names, then verify accuracy of the word or words being a scientific name by comparing them to NameBank, uBio.org's repository of more than 10.7 million recorded scientific names and their variants. The resulting index of names found throughout these historic texts is an incredibly valuable dataset, whose richness and use has just begun development.

The massive index and interfaces to it are new (from development to production within 8 weeks), so the BHL Development Team has been gathering feedback from users, evaluating usage statistics, and working with both librarians and scientists to determine what is working with the interface and what needs refinement. The following issues have been identified:

1. Volume and scalability
BHL currently manages 2.9 million pages in its database, with each page equating to an image & its derivatives stored on a filesystem at the Internet Archive. Using uBio's services, we've located a total of 14.7 million name strings across texts, with 10.4 million of those verified to an entry in NameBank.

Scalability quickly becomes an issue as BHL expects to digitize 60 million pages within 5 years. Faced with hundreds of millions of name occurrences, the challenge becomes how to efficiently store and query this dataset. BHL data are currently stored in SQL Server 2005, which can scale to expected volumes and contains tools for load balancing and clustering. Ultimately, though, these issues of volume and scalability are resolvable as the dataset is not excessively complicated in structure. With enterprise-level hardware, optimized code and data access layers, and intelligent cacheing (all of which are currently in use), BHL can efficiently store and provide access to the vast index of scientific names identified through algorithmic means.

2. OCR

Commercial Optical Character Recognition (OCR) programs, such as ABBY FineReader or PrimeOCR, work very well for texts printed after the advent of industrialized and standardized printing techniques (loosely since the late 1800's). Unfortunately the OCR programs are considerably less accurate on texts that match the characteristics of much of what BHL is scanning, including texts printed with irregular typeface and typesetting, and texts printed in multiple languages, including Latin.

The impact here is that if the texts are not accurately recognized, the names contained within can't be identified. The accuracy of the OCRed text is therefore incredibly important, and unfortunately nearly impossible to improve through automated means as OCR technology has not really changed much since the mid-1980's. Alternatives such as offshore rekeying or volunteer text conversion through the Distributed Proofreaders or other crowdsourcing projects are either prohibitively expensive or would require enormous effort above and beyond what could be volunteered given BHL's estimated page count. BHL is not alone in facing this problem; every initiative that OCRs historic texts has encountered this unfortunate gap in accuracy. If you are aware of any new efforts to improve OCR, please use the comment form below.

3. False positives
As BHL was indexing botanical texts repeated occurrences of "Ovarium" were being located; an unusual result as Ovarium is both an echinoderm (marine invertibrate) as well as a term used in botany to describe the lower part of the pistil or female organ of the flower. After reviewing the page occurrences it became clear that the TaxonFinder algorithm was accurately identifying a word and making a match to an entry in NameBank, but in this case the context was off. In nearly every entry, the word "ovarium" was not used to describe the marine invertebrate, but rather to describe the form of a flower in a taxonomic description. Similar false positives exist, such as Capsula and Fructus.

Upon further review the problem is most prevalent with names used at higher classification levels; results for "Genus species", such as Carcharodon carcharias (Great white shark) are much less likely to be false positives. Clearly more evaluation is needed to understand the true magnitude of the problem, hopefully resulting in refinement of the TaxonFinder algorithm.

4. Usability
Gregory Crane of Tufts University asked, in an oft-cited paper, "What Do You Do With a Million Books?" The challenge facing BHL Developers (and users) is more along the lines of "What do you do with 19,000 pages containing Hymenoptera?"

Because the BHL names index is growing rapidly, the methods of viewing and filtering results in a meaningful way becomes challenging. It's clear that a user isn't going to manually sift through and review every one of those pages. We can facilitate downloading the results in standard forms for reference management software, such as Zotero or EndNote, but how does BHL introduce relevancy rankings or other metrics for refining results - what exactly defines relevancy for occurrences of a name throughout scientific literature?

5. Accuracy and completeness
And now for a reality check. BHL text will never be 100% accurate, and our names index will never be 100% complete. We're using automated software and services to process the millions of pages in the BHL collection because to do anything but an automated analysis simply won't scale. The names index and the services that support its creation and display are modular - should radically new character or word recognition software come along, the scanned images can be reprocessed and reindexed using TaxonFinder. And should a better taxonomic name finding algorithm emerge, it can replace TaxonFinder in our application. As technologies emerge to improve text transcription and indexing, BHL will evaluate them and deploy them with our app is they prove effective.

Future work
It's clear that we've identified enhancements needed in TaxonFinder to reduce the number of false positives. How best to implement those enhancements is yet to be determined, but at least we have data to guide us. We also plan to enhance the interface used for the discovered bibliographies, as the current implementation is not performant for large result sets. Further, we expect to facilitate downloading of the results in a standard format, such as BibTeX.

In closing, BHL is currently employing emerging technologies to transcribe and index a large collection of digitized scientific literature, and providing innovative interfaces into the data mined from it. These interfaces are rapidly evolving to meet user needs, based on user feedback, so if you have a suggestion for improvement please provide it via our Feedback form or on the comments below.

-

Sunday, March 2, 2008

A Leap for All Life: BHL & EOL


2008-03-01-dscn2644
Originally uploaded by martin_kalfatovic
The Biodiversity Heritage Library and the Encyclopedia of Life shared a table at the Congressional Family Night held at the Smithsonian's National Museum of Natural History.

The event (March 1, 2008) showcased a wide range of scientific endeavors engaged in by Smithsonian staff and was attended by members of Congress, their staff, and families.

Here, Cristián Samper, Acting Secretary of the Smithsonian and EOL steering committee member looks on as Gil Taylor (Smithsonian Institution Libraries) and Dawn Mason (EOL) demonstrate the recently launched EOL species pages.

Monday, February 25, 2008

Major updates to BHL Portal released

BHL developers have released several significant updates to the BHL portal today. These updates include:
  • Display of materials scanned by Internet Archive. BHL now manages more than 2.8 million pages from 7,500 digitized scientific texts. To stay updated on new titles, view our Recent Additions and subscribe to our feeds.
  • Filtering by Contributing Library. When users select "Browse By:" functions, they can filter results using the "For:" dropdown to view, for example, Authors from the New York Botanical Garden, or a Map of titles scanned by Smithsonian Institution Libraries, or Titles from All Contributors.
  • Feedback tracking. Users can submit feedback or comments on records using the Feedback link at the top of the portal.
For a complete list of bugs and enhancements included in this release, visit our issue tracking web site.

Tuesday, February 12, 2008

Happy Birthday Mr. Darwin!


Charles Darwin
Originally uploaded by Smithsonian Libraries
“The cultivation of natural science cannot be efficiently carried on without reference to an extensive library.” (1)
- Charles Darwin, et al (1847)


Today, February 12, 2008, we celebrate the 199th anniversary of the birth of Charles Darwin. Last year we honored the 300th anniversary of the birth of Carl Linné and next year will be the double celebrations for Darwin's bicentenary and the sesquicentennial (mark your calenders now for November 24th!) of the publication of On the Origin of Species. 2008 is thus a good year for those of us involved with the Biodiversity Heritage Library (BHL) to pause for a moment between these landmark anniversary years of 2007 and 2009.

Those working in systematics and taxonomy are heavily dependent on the historic literature – to a greater extent than perhaps most of the sciences. This importance of the literature, as well as the ongoing importance of publication (and library deposit) to validate taxonomic concepts, contribute to the mission and continue to inform the day to day development of the BHL.

Darwin himself acknowledged the importance of library materials to the study of natural history in the passage quoted above (in a document signed by Darwin and over 30 other notables including Charles Lyell, W.J. Hooker, and Richard Owen) which was part of an appeal for support of natural history research at the British Museum.

- Martin Kalfatovic

Portrait of Charles Darwin by Ernest Edwards
From Scientific Identity: Portraits from the Dibner Library of the History of Science and Technology. Smithsonian Institution Libraries

(1) Darwin, C. R. et al. 1847. Copy of Memorial to the First Lord of the Treasury [Lord John Russell], respecting the Management of the British Museum. Parliamentary Papers, Accounts and Papers 1847, paper number (268), volume XXXIV.253 (13 April): 1-3. [Complete Works of Charles Darwin Online]

Monday, February 4, 2008

BHL part of the "Biological Moon Shot"

Thomas Garnett of the Smithsonian's National Museum of Natural History heads a scanning and digitization group of encyclopedia workers. They are cooperating with the Biodiversity Heritage Library, a project through which 10 major libraries are scanning and placing on the Web pages from volumes that describe species. Some 80 million pages come from publications old enough to be in the public domain, and the scanners are starting with those.
The Feb. 2, 2008 issue of Science News includes an article by Susan Milius ("Biological Moon Shot") on the Encyclopedia of Life and the Biodiversity Heritage Library. BHL member staff Tom Garnett and Martin Kalfatovic are quote in the article.

In talking about the vital business of opening library resources to far-flung scientists, Garnett rolls his eyes at the mention of a specialized source for historians of science that has become one of the library's most popular downloads—the 1904 treatise Ants and Some Other Insects: An Inquiry Into the Psychic Powers of These Animals.

Wednesday, January 30, 2008

BHL presentation at the National Agriculture Library


2008-01-30-dscn2405
Originally uploaded by martin_kalfatovic
Smithsonian Institution Libraries staff members Martin Kalfatovic and Suzanne Pilsk gave a presentation on BHL to staff from the National Agriculture Library, the USDA Agriculture Research Service, the NASA Goddard Space Flight Center, and others.

  • The Biodiversity Heritage Library. Martin R. Kalfatovic and Suzanne C. Pilsk. National Agriculture Library: Issues and Answers Seminar. January 30, 2008. Beltsville, MD.

2008-01-30-dscn2409

Thursday, January 24, 2008

"How's THAT for a tag cloud?!"

The NC State Insect Museum blog gave the Biodiversity Heritage Library

Finally got around to perusing my December, 2007 issue of Systematic Biology and saw this article by Godfray et al. about taxonomy and the Web. The authors provide nice summaries of emerging, alternative strategies for tackling the biodiversity and bioinformatics crises: CATE, uBio, DiGIR (to be replaced by TAPIR soon?), GBIF, Biodiversity Heritage Library initiative (how's THAT for a tag cloud?!), ZooBank, TDWG, iSpecies, and Wikispecies (my least favorite; at least I am not yet totally convinced that this is a good model for taxonomy). I find it curious that the Encyclopedia of Life was barely mentioned (and never by name) in that article, especially given its high profile and funding level. I'll have to remember to link some of these projects to our museum page, as we will undoubtedly be exploiting these resources and techniques to expose the data housed within our cabinets.
- Insect Museum blog

Friday, January 4, 2008

Senior Programmer needed to assist BHL development

The Missouri Botanical Garden (MOBOT), located in St. Louis, MO, is seeking to hire a Senior Programmer Analyst to work on several large biodiversity informatics projects, including the Biodiversity Heritage Library (BHL) online at www.biodiversitylibrary.org.

Primary responsibilities for this position include leading the development effort for MOBOT's LAMP-based applications, complementing the existing .Net team. Up first on the development schedule is the instantiation of Fedora (www.fedora-commons.org) at MOBOT as a repository layer in our multi-platform, SOA-based infrastructure, then refactoring applications and building new ones to utilize Fedora. Future projects include enhancement of the BHL GUI and development of tools for managing digital library content.

Qualifications include a BS in Computer Science or related field, 5 years experience developing enterprise-level applications, and 2 years experience leading a development team. Experience managing data and applications in an open source environment (LAMP and its variants) required. Experience managing biodiversity and/or library datasets preferred, but not required.

To apply online, please visit:
http://www.mobot.org/jobs/mbgjobs.asp#H005

Wednesday, January 2, 2008

Biodiversity Heritage Library - Europe

The BHL (http://www.biodiversitylibrary.org/About.aspx) currently consists of English language collections from the USA and UK (although we have huge amounts of material in over 40 other languages). I am working with European colleagues to develop a programme of activity in Europe to cover the other European languages. German and Netherlands colleagues are already working on bids and trial scanning. We are preparing a bid to the EU eContentplus programme for money to manage these activities across Europe (unfortunately, the EU will not fund scanning directly) and this will be lead by the Museum für Naturkunde, Berlin.

We are currently looking for partners to join the eContentplus bid - in particular, we are looking for institutions with substantial collections of biodiversity literature, experts in scanning and digitisation, and researchers interested in OCR (optical character recognition) technologies. If you are interested in joining us, please contact me on g.higley@nhm.ac.uk.

Graham Higley
Wednesday, January 2nd, 2008