Tuesday, August 19, 2008

But where are the articles??

Many researchers are used to searching or browsing for materials by article. Article level access to BHL content is a goal that we're striving for, and one that we haven't yet reached!

BHL is a mass scanning operation. Our member libraries are moving as quickly as possible through a range of materials - books, serials, etc. - in order to scan as much as possible during our relatively brief window of funding. Our goal is to scan & cache now, then add in advanced technology solutions for secondary post-processing as they are developed.

We've found that in scanning historic scientific monographs and journals, article identification is too labor intensive (and expensive) to do by hand. BHL staff, through connections formed by our scanning partner, Internet Archive, have been working with Penn State's College of Information Sciences and Technology (the developers behind CiteSeer) to provide a test bed algorithm to extract article metadata from historic literature.

There are a number of challenges in our digitized historic literature that cause even the most scalable, sophisticated algorithms to return inaccurate results. These include:
  • uncorrected source OCR (accuracy is problematic)
  • multiple foreign languages, including Latin
  • irregular printing processes and type setting in historic literature
  • change of printing process and issue frequency during course of a journal run
Still, progress is being made:
  • Penn State's algorithms have been demonstrated, as in this example
  • They need access to a wider testbed for improved machine learning, which is now available (7.4 million pages in BHL as of this writing)
But the work is far from finished. Next steps:
  • need to refine algorithms
  • need interfaces for human editing
    • too many possible inaccuracies upstream
    • possibly distribute this task via Mechanical Turk or other 'clickworking' network?
  • define workflow
    • first pass: algorithms; second pass: volunteers; editorial review?
But what about Google, you say? They're scanning books en masse. They're smart. Haven't they solved the problem?? The quick answer is "No, not really." Take a look at the "Contents" section for:
Zoologist: A Monthly Journal of Natural History, ser.4 v.12 1908

This is what Google can do...with all of their resources & grey matter. BHL is many things to many people, but we're certainly not Google!

And here's where we need your input:
So what can you, an enthusiastic supporter of BHL who wants access to articles in our collection, do to help? We're glad you asked!
  1. Regardless of the means by which we actually get the article metadata, we've got to store that alongside all of our other content in BHL. We have released an UPDATED (9/3/2008) data model supporting articles for review & comment, guided by our research into NLM, OpenURL, and OAI-ORE, and with help from the expert developers behind the Public Library of Science.
  2. We need to hear how you, enthusiastic BHL supporter, expect to access and use article-based content. We're looking for information about the sites you use and like with similar content, as well as your general expectations for delivery of articles in BHL. We know that's a wide open question; here's your chance to bend our ear(s).
Please use the "Comment" feature below to drop off your suggestions and ideas for both topics. Or, if you'd prefer to keep your opinions private, e-mail them to chris (dot) freeland (at) mobot (dot) org.

Looking forward to the feedback, and to providing this important method of access to the wealth of content in BHL.

Chris Freeland
Technical Director, BHL

1 comment:

  1. Chris,

    At it's simplest the triple (journal, volume, starting page) is sufficient to uniquely identify most articles. Hence, my vote is provide an OpenURL resolver. It would need a database of journal abbreviations to handle alternative spellings (one could also use approximate string matching, which I do in bioGUID (presently offline while I rebuild the server). OpenURL gets you into tools like Connotea and Firefox extensions that handle CoinS, could make a BHL OpenURL resolver a drop-in replacement for projects like EDIT, and postpones the GUID issue. It also makes it easy to do bulk searches for references (which taxonomic database maintainers will want to do).

    Dates are another piece of metadata (SICIs as used by JSTOR depend on them) but in my experience they can be incorrect, and taxonomists obsess about the actual publication date, hence given some bibliographic metadata it's not always clear what the actual date should be.

    You could then generalise OpenURL to handle pages within the (start page, end page) range, making it easier for nomenclators to locate article metadata given only a single page reference.

    Returning to the GUID issue, Robert Cameron has an elegant treatment (Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention which covers most things. Simple and open wins.

    Won't ay much about the data model, these diagrams are always hideous and rarely enlightening IMHO. I think you'll want to tease apart first and last names of authors, and deal with multiple versions of the same name. I found On identifying name equivalences in digital libraries to be a good read on this topic.