Tuesday, August 19, 2008

But where are the articles??

Many researchers are used to searching or browsing for materials by article. Article level access to BHL content is a goal that we're striving for, and one that we haven't yet reached!

BHL is a mass scanning operation. Our member libraries are moving as quickly as possible through a range of materials - books, serials, etc. - in order to scan as much as possible during our relatively brief window of funding. Our goal is to scan & cache now, then add in advanced technology solutions for secondary post-processing as they are developed.

We've found that in scanning historic scientific monographs and journals, article identification is too labor intensive (and expensive) to do by hand. BHL staff, through connections formed by our scanning partner, Internet Archive, have been working with Penn State's College of Information Sciences and Technology (the developers behind CiteSeer) to provide a test bed algorithm to extract article metadata from historic literature.

There are a number of challenges in our digitized historic literature that cause even the most scalable, sophisticated algorithms to return inaccurate results. These include:
  • uncorrected source OCR (accuracy is problematic)
  • multiple foreign languages, including Latin
  • irregular printing processes and type setting in historic literature
  • change of printing process and issue frequency during course of a journal run
Still, progress is being made:
  • Penn State's algorithms have been demonstrated, as in this example
  • They need access to a wider testbed for improved machine learning, which is now available (7.4 million pages in BHL as of this writing)
But the work is far from finished. Next steps:
  • need to refine algorithms
  • need interfaces for human editing
    • too many possible inaccuracies upstream
    • possibly distribute this task via Mechanical Turk or other 'clickworking' network?
  • define workflow
    • first pass: algorithms; second pass: volunteers; editorial review?
But what about Google, you say? They're scanning books en masse. They're smart. Haven't they solved the problem?? The quick answer is "No, not really." Take a look at the "Contents" section for:
Zoologist: A Monthly Journal of Natural History, ser.4 v.12 1908

This is what Google can do...with all of their resources & grey matter. BHL is many things to many people, but we're certainly not Google!

And here's where we need your input:
So what can you, an enthusiastic supporter of BHL who wants access to articles in our collection, do to help? We're glad you asked!
  1. Regardless of the means by which we actually get the article metadata, we've got to store that alongside all of our other content in BHL. We have released an UPDATED (9/3/2008) data model supporting articles for review & comment, guided by our research into NLM, OpenURL, and OAI-ORE, and with help from the expert developers behind the Public Library of Science.
  2. We need to hear how you, enthusiastic BHL supporter, expect to access and use article-based content. We're looking for information about the sites you use and like with similar content, as well as your general expectations for delivery of articles in BHL. We know that's a wide open question; here's your chance to bend our ear(s).
Please use the "Comment" feature below to drop off your suggestions and ideas for both topics. Or, if you'd prefer to keep your opinions private, e-mail them to chris (dot) freeland (at) mobot (dot) org.

Looking forward to the feedback, and to providing this important method of access to the wealth of content in BHL.

Chris Freeland
Technical Director, BHL

Tuesday, August 5, 2008

Revised BHL Data Model

The latest revision of the BHL Data Model is now available for review at:
http://www.biodiversitylibrary.org/documents/BHLDataModel_20080805.pdf