BHL News, Blog Reel, Tech Updates

But where are the articles??

Many researchers are used to searching or browsing for materials by article. Article level access to BHL content is a goal that we’re striving for, and one that we haven’t yet reached!

BHL is a mass scanning operation. Our member libraries are moving as quickly as possible through a range of materials – books, serials, etc. – in order to scan as much as possible during our relatively brief window of funding. Our goal is to scan & cache now, then add in advanced technology solutions for secondary post-processing as they are developed.

We’ve found that in scanning historic scientific monographs and journals, article identification is too labor intensive (and expensive) to do by hand. BHL staff, through connections formed by our scanning partner, Internet Archive, have been working with Penn State’s College of Information Sciences and Technology (the developers behind CiteSeer) to provide a test bed algorithm to extract article metadata from historic literature.

There are a number of challenges in our digitized historic literature that cause even the most scalable, sophisticated algorithms to return inaccurate results. These include:

uncorrected source OCR (accuracy is problematic)
multiple foreign languages, including Latin
irregular printing processes and type setting in historic literature
change of printing process and issue frequency during course of a journal run

Still, progress is being made:

Penn State’s algorithms have been demonstrated, as in this example
They need access to a wider testbed for improved machine learning, which is now available (7.4 million pages in BHL as of this writing)

But the work is far from finished. Next steps:

need to refine algorithms
need interfaces for human editing
- too many possible inaccuracies upstream
- possibly distribute this task via Mechanical Turk or other ‘clickworking’ network?
define workflow
- first pass: algorithms; second pass: volunteers; editorial review?

But what about Google, you say? They’re scanning books en masse. They’re smart. Haven’t they solved the problem?? The quick answer is “No, not really.” Take a look at the “Contents” section for:
Zoologist: A Monthly Journal of Natural History, ser.4 v.12 1908

This is what Google can do…with all of their resources & grey matter. BHL is many things to many people, but we’re certainly not Google!

And here’s where we need your input:
So what can you, an enthusiastic supporter of BHL who wants access to articles in our collection, do to help? We’re glad you asked!

Regardless of the means by which we actually get the article metadata, we’ve got to store that alongside all of our other content in BHL. We have released an UPDATED (9/3/2008) data model supporting articles for review & comment, guided by our research into NLM, OpenURL, and OAI-ORE, and with help from the expert developers behind the Public Library of Science.
We need to hear how you, enthusiastic BHL supporter, expect to access and use article-based content. We’re looking for information about the sites you use and like with similar content, as well as your general expectations for delivery of articles in BHL. We know that’s a wide open question; here’s your chance to bend our ear(s).

Please use the “Comment” feature below to drop off your suggestions and ideas for both topics. Or, if you’d prefer to keep your opinions private, e-mail them to chris (dot) freeland (at) mobot (dot) org.

Looking forward to the feedback, and to providing this important method of access to the wealth of content in BHL.

August 19, 2008

Written by Chris Freeland

Chris Freeland served as the BHL Technical Director from 2006-2012. He is currently the Director of the Open Libraries program at Internet Archive. In this capacity he works with libraries & publishers to digitize their collections, working towards the Archive’s mission of providing “universal access to all knowledge.”

1 Comment

Rod Page August 20, 2008 at 8:20 am Reply

Chris,

At it’s simplest the triple (journal, volume, starting page) is sufficient to uniquely identify most articles. Hence, my vote is provide an OpenURL resolver. It would need a database of journal abbreviations to handle alternative spellings (one could also use approximate string matching, which I do in bioGUID (presently offline while I rebuild the server). OpenURL gets you into tools like Connotea and Firefox extensions that handle CoinS, could make a BHL OpenURL resolver a drop-in replacement for projects like EDIT, and postpones the GUID issue. It also makes it easy to do bulk searches for references (which taxonomic database maintainers will want to do).

Dates are another piece of metadata (SICIs as used by JSTOR depend on them) but in my experience they can be incorrect, and taxonomists obsess about the actual publication date, hence given some bibliographic metadata it’s not always clear what the actual date should be.

You could then generalise OpenURL to handle pages within the (start page, end page) range, making it easier for nomenclators to locate article metadata given only a single page reference.

Returning to the GUID issue, Robert Cameron has an elegant treatment (Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention which covers most things. Simple and open wins.

Won’t ay much about the data model, these diagrams are always hideous and rarely enlightening IMHO. I think you’ll want to tease apart first and last names of authors, and deal with multiple versions of the same name. I found On identifying name equivalences in digital libraries to be a good read on this topic.

Cancel Reply

About BHL

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. Headquartered at the Smithsonian Libraries and Archives in Washington, D.C., BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access as part of a global “biodiversity community.”

But where are the articles??

Related Posts

1 Comment

Leave a Comment

Cancel Reply

Help Support BHL

Search

About BHL

Follow BHL

Join Our Mailing List

Subscribe to our Blog Via RSS

But where are the articles??

Related Posts

Book of the Week: The Power of the Dog

24th meeting of the Global Biodiversity Information Facility (GBIF) Governing Board

John Torrey’s Calendarium Florae for the Vicinity of New York (1818, 1819 & 1820)

1 Comment

Leave a Comment

Cancel Reply

Help Support BHL

Search

About BHL

Follow BHL

Join Our Mailing List

Subscribe to our Blog Via RSS