SHARE

Tuesday, June 7, 2011

BHL and Our Users: Rod Page and BioStor

This week, we feature one of our users that has been extraordinarily active in not only using BHL content, but in creating applications that significantly enhance the information and knowledge that can be gleaned from our resources. The creator of BioStor and a huge player in the realm of biodiversity informatics, meet Dr. Roderic Page!

In the beginning..."meh"

I first became aware of the Biodiversity Heritage Library around 2007. To be honest, initially I was underwhelmed. BHL didn't seem to have much literature, what it did have was mostly about plants (I'm a zoologist by background), the interface was a bit clunky, and most of the content was pre-1923, which to me simply echoed the impression that taxonomy is a science that is something of a backwater, obsessed with ancient documents and arcane terminology.

So at the start I wasn't much of a fan. But as BHL grew it started to add more recent content, particularly for museum journals, as well as vital content such as the Bulletin of Zoological Nomenclature, and I realised that it was going to be much more useful than I'd previously thought. So I started playing with ways to visualise content from BHL, such as timelines to plot search results over time, and sparklines to show how the relative frequency of different names for the same organism would change over time (similar to the nice visualisations Ryan Schenk has done recently.)

But where are the articles?

These experiments were fun, but I keep coming up against what for me was the show stopper: BHL had no concept of a scientific article. Because it was a library project the basic unit in BHL was a scanned item, which could correspond to anything from a book, one or more volumes of a journal, or a single article. Whereas librarians deal with volumes on shelves, for most scientists the unit that matters is the article, and there was no easy way to find articles in BHL. To be fair, BHL was well aware of this mismatch between library practice and the expectations of scientists (see Chris Freeland's post But where are the articles??).

I'd spent a lot of time developing a tool called bioGUID, which was designed to find articles online using just the journal name, volume, and starting page. It uses a range of web services to find the article, such as talking to CrossRef to see if it had a DOI (the ubiquitous identifier for modern articles), as well as searching other sources, such as JSTOR. I wanted something like this for BHL, where you could simply take those three things - journal, volume, starting page - and go straight to the article. A common way to provide this service is through a protocol called OpenURL, which takes the journal, volume, starting page for an article and looks for it online.

However, finding articles in BHL is a challenging task, not least because there is little standardisation in how library catalogues record bibliographic information. To give just one example, for the journal Proceedings of the Zoological Society of London here are some of the ways information about a volume is recorded.
  • Part 1- Part 4 (1833-38)
  • 1856
  • 1901, v. 1 (Jan.-Apr.)
  • Jan-Apr 1906
  • 1912 v. 2
  • 1923, pt. 1-2 (pp. 1-481)

So any tool to find articles has to deal with these issues. But after a few experiments I decided it would be possible to find lots of articles in BHL, especially if I had access to all the BHL data on my own computers. So, I grabbed a copy of the data and created BioStor.

BioStor

Below is a screen shot of BioStor, which at the moment has over 31,000 articles from BHL.

There are two main ways to use BioStor. The first is as a website where you can browse or search for articles. You can search for articles about taxa by adding the taxon name to http://biostor.org/name/, for example http://biostor.org/name/Zonosaurus. In addition to displaying the article, BioStor displays the names found in the article as a tag cloud and a classification, and in some cases also shows a map with localities that have been automatically extracted from the text and displayed on the map, such as this example from A revision of the dwarf Zonosaurus Boulenger (Reptilia: Squamata: Cordylidae) from Madagascar, including descriptions of three new species:

The other way you can use BioStor is as an OpenURL resolver. Bibliographic software and websites such as EndNote, Zotero, and Mendeley all support OpenURL, so you can be looking at an article in one of those databases and automatically look for it in BioStor.

BioStor needs bibliographies

One thing I've glossed over is how BioStor has managed to find thousands of articles. Some have been added manually, but this rapidly gets tedious. For the majority of articles what I've done is take an existing bibliography for a journal, or a taxonomic group, and write a small computer programme (or "script") to get BioStor to find the articles automatically. For example, I quickly added most of the articles in the journal Tijdschrift voor Entomologie becase I had an EndNote file containing those references.

I spend a lot of time searching for bibliographies, downloading them or scrapping them from websites, converting them into a readable format, then using scripts to ask BioStor to locate the article in BHL. I'm somewhat taken aback by how hard it is to get these bibliographies. If taxonomists and/or journal editors made these available, we could add many more articles to BioStor. While one approach is to beg, borrow, or steal bibliographies, I'm hoping that the rise of online bibliography databases and associated social networks, especially Mendeley, will generate the bibliographies I need to efficiently find articles in BHL.

What's next?

BioStor has some obvious limitations, notably the assumption that older literature works the same way as modern articles. Whereas today figures, tables, and text are all contained within the page range of an article, it's not uncommon in older (pre-20th centruy) literature for figures and plates to be physically separate from the text. BioStor can't really handle this, so one day I plan to add the ability to have discontinuous page ranges that will include these figures and plates.

Despite the fact that I've spent a lot of time creating and populating BioStor, in reality it is a side project running on a Mac Mini on my desk. At some point it would be nice to feed BioStor's data back in to BHL itself, so users of BHL could more easily find articles without leaving that web site. BHL also has more resources for ensuring the long term survival of data than I do.

What do I think of BHL now?

Despite my initial lack of enthusiasm, I now see BHL as one of the great resources of biodiversity informatics. There's some extraordinary stuff in BHL, and it keeps growing. It's also been great working with Chris Freeland, Phil Cryer, and Mike Lichtenberg, who have all been very helpful, even when I've written blog posts venting my frustration with BHL's limitations. I think it's definitely one of those cases where you only complain about the things you actually care about.

No comments: