Tuesday, April 21, 2009

File formats for citation storage and distribution

We’ve been investigating options for storage and distribution of citation data in the Biodiversity Heritage Library. In particular, we are searching for an appropriate "core" format. The thought is that with an appropriately verbose, open, standard core format for our citations, we can transform that format into whatever other format we might want to support. By “verbose”, we mean a format that can support all of the information that we need to preserve. By “open”, we’re looking for a format that’s not tied exclusively to one system or vendor. And by “standard”, we’re hoping to identify a format that is widely recognized by the library community.

Some of the information found in this Wikipedia article has guided the research: Specifically, the information found there about which formats are supported in each of the various applications is useful.

Following is a brief description of the format candidates we’ve investigated, as well as our preliminary conclusions.

If you have experience with one or more of these formats and would like to help us make our decision, please post your comments below.

Mike Lichtenberg
Missouri Botanical Garden

Wednesday, April 15, 2009

PDF Article Metadata Analysis

In previous posts we have discussed the issues surrounding the identification of articles contained within BHL scanned books and the new interface we've developed that let's users build their own PDFs for download. In that interface(demo for The Journal of agricultural science, v.7) we ask users who are building a PDF of an article to contribute the article title, author(s), and subjects/tags and we'll store that information alongside the generated PDF and make it available for other users to search and download.

As this was our first attempt at crowdsourcing, we didn't know what kind of data quality to expect. We have been monitoring the data submitted since releasing the functionality in January, and formed the criteria for a more formal analysis. After reviewing the metadata for a sample of 50 PDFs out of a total of 802 generated between January 15, 2009 and the end of March the following trends were revealed:

  • 88% of articles were assigned article-level titles by users, indicating that they are comfortable entering metadata without a great deal of prompting. So far, the only guidance in the interface is "Are you generating a PDF containing the text of a single journal article or book chapter? If so, please help us out by providing the following information!"
  • 22% of the PDFs generated could not be considered true articles. They were determined to be arbitrary selections of pages.
  • 24% of the PDFs generated were not articles in the bibliographic sense but were species descriptive/relevant excerpts from larger works.
  • 50% of the PDFs generated could be considered true articles in the bibliographic sense, complete with identifiable titles, authors, and subjects.
Metadata Accuracy Stats
Accuracy was measured on a scale of low to high for title, author, and subject


  • 55% of article title metadata was found to be highly accurate
  • 14% was considered medium = interpreted or modified
  • 27% was considered low = extrapolated from a non-obvious source OR the article title was available but a poorly descriptive article title was attributed or no article-level title was provided at all
  • 67% high accuracy, however formatting issues will need to be addressed to streamline differences in Firstname Lastname entries. Anything from Bianca Lipscomb to Lipscomb, B. to B. Lipscomb was found; users did not necessarily follow the formatting presented in the original text
  • 29% at medium accuracy, meaning that author names were either significantly abbreviated or interpreted from the source text
  • only 2% at low accuracy, meaning that no author was attributed to the article even though it should have been
Subjects were more difficult to analyze as compared with titles and authors above. For the most part, I was satisfied with only 6 instances of subject attribution, i.e. appropriate subject and geographic keywords. Many users simply neglected subjects or used the original title for the subject. Either way, I think it important that subjects are required metadata in order to trace associations between articles in the repository. This is, of course, coming from a librarian's perspective.

Please use the comment form below for questions.

Bianca Lipscomb
Collections Manager, Biodiversity Heritage Library
lipscombb (at) si (dot) edu

Friday, April 10, 2009

Improved handling of diacritics in BHL searches

I wanted to let everyone know about a change that has been made to the search function of the BHL portal.

Until now, letters that include diacritics (for example, ó, ö, è, é, û) were treated differently than letters without diacritics.

What this meant is that in order to find titles, authors, or subjects that included diacritics, you had to search for an exact match on the diacritic... for example, to find all titles about "invertebrate zoology", you had to search twice: once for "invertebrate zoology" and once for "invertebrate zoölogy". (Or you had to search for something like "invertebrate zo" and hope you didn't get too much extra stuff in the search results.) Obviously, there are all sorts of problems with this limitation.

Starting immediately, searches in the BHL portal are accent-insensitive, so no distinction is made between letters with and without diacritics. This means that a search for "invertebrate zoology" will now find all nine titles that contain either "invertebrate zoology" or "invertebrate zoölogy". See the search results here: Another good example is searches for "Linne", which now return instances of both "Linne" and "Linné".

While there is still more work to do to improve the search features, this is a good first step to improving the quality of our search results.

Mike Lichtenberg
Missouri Botanical Garden

Wednesday, April 1, 2009

Return of the Dodo?! BHL joins the larger scientific community in expressing shock, excitement!

What appears to be a male Dodo bird (Raphus cucullatus) approximately 2 and a half feet tall has been discovered today, April 1, nesting among the bushes in the butterfly garden of the Smithsonian's National Museum of Natural History. How the creature, that hasn’t been seen since the 17th century, has come to reside Washington, DC remains a mystery. Smithsonian Security was alerted to the presence of the bird by an alert tourist who heard the loud syncopated calls of the once thought extinct bird. Nicknamed “Lazarus,” the Dodo exhibits a friendly curiosity about humans, clearly enjoying the excitement and attention its presence is generating.

Orthinologists and evolutionary biologists from around the world are currently pouring into Washington to investigate.

The Biodiversity Heritage Library contains a number of references to the Dodo's current taxonomic name (Raphus cucullatus), as well as over two hundred references to the earlier scientific synonym (Didus ineptus). You can also find references to the taxonomic family (Raphidae) and 400 plus references to Didus, the synonym for the genus.

The description, by Linnaeus, can be found in the 1766 edition of Systeme naturae.

A few of illustrations of the Dodo from the BHL can be found here:
Of particular interest is this short article by John V. Thompson. "Art. IV. Contributions towards the Natural History of the Dodo (Didus ineptus Lin.), (Fig. 107.) a Bird which appears to have become extinct towards the End of the Seventeenth or Beginning of the Eighteenth Century" in Magazine of natural history and journal of zoology, botany, mineralogy, geology and meteorology (volume 2, 1829), pp.442-48. At the time the article was written, there was still debate on the topic of extinction and if species known to contemporary humans could cease to exist. Thompson's article also includes an overview summary of the then known history of the Dodo.

Learn more about life at the Biodiversity Heritage Library!

Reported by Erin Thomas, Smithsonian Institution Libraries