BHL is a mass scanning operation. Our member libraries are moving as quickly as possible through a range of materials - books, serials, etc. - in order to scan as much as possible during our relatively brief window of funding. Our goal is to scan & cache now, then add in advanced technology solutions for secondary post-processing as they are developed.
We've found that in scanning historic scientific monographs and journals, article identification is too labor intensive (and expensive) to do by hand. BHL staff, through connections formed by our scanning partner, Internet Archive, have been working with Penn State's College of Information Sciences and Technology (the developers behind CiteSeer) to provide a test bed algorithm to extract article metadata from historic literature.
There are a number of challenges in our digitized historic literature that cause even the most scalable, sophisticated algorithms to return inaccurate results. These include:
- uncorrected source OCR (accuracy is problematic)
- multiple foreign languages, including Latin
- irregular printing processes and type setting in historic literature
- change of printing process and issue frequency during course of a journal run
- Penn State's algorithms have been demonstrated, as in this example
- They need access to a wider testbed for improved machine learning, which is now available (7.4 million pages in BHL as of this writing)
- need to refine algorithms
- need interfaces for human editing
- too many possible inaccuracies upstream
- possibly distribute this task via Mechanical Turk or other 'clickworking' network?
- define workflow
- first pass: algorithms; second pass: volunteers; editorial review?
Zoologist: A Monthly Journal of Natural History, ser.4 v.12 1908
This is what Google can do...with all of their resources & grey matter. BHL is many things to many people, but we're certainly not Google!
And here's where we need your input:
So what can you, an enthusiastic supporter of BHL who wants access to articles in our collection, do to help? We're glad you asked!
- Regardless of the means by which we actually get the article metadata, we've got to store that alongside all of our other content in BHL. We have released an UPDATED (9/3/2008) data model supporting articles for review & comment, guided by our research into NLM, OpenURL, and OAI-ORE, and with help from the expert developers behind the Public Library of Science.
- We need to hear how you, enthusiastic BHL supporter, expect to access and use article-based content. We're looking for information about the sites you use and like with similar content, as well as your general expectations for delivery of articles in BHL. We know that's a wide open question; here's your chance to bend our ear(s).
Looking forward to the feedback, and to providing this important method of access to the wealth of content in BHL.
Technical Director, BHL