Wednesday, April 15, 2009

PDF Article Metadata Analysis

In previous posts we have discussed the issues surrounding the identification of articles contained within BHL scanned books and the new interface we've developed that let's users build their own PDFs for download. In that interface(demo for The Journal of agricultural science, v.7) we ask users who are building a PDF of an article to contribute the article title, author(s), and subjects/tags and we'll store that information alongside the generated PDF and make it available for other users to search and download.

As this was our first attempt at crowdsourcing, we didn't know what kind of data quality to expect. We have been monitoring the data submitted since releasing the functionality in January, and formed the criteria for a more formal analysis. After reviewing the metadata for a sample of 50 PDFs out of a total of 802 generated between January 15, 2009 and the end of March the following trends were revealed:

  • 88% of articles were assigned article-level titles by users, indicating that they are comfortable entering metadata without a great deal of prompting. So far, the only guidance in the interface is "Are you generating a PDF containing the text of a single journal article or book chapter? If so, please help us out by providing the following information!"
  • 22% of the PDFs generated could not be considered true articles. They were determined to be arbitrary selections of pages.
  • 24% of the PDFs generated were not articles in the bibliographic sense but were species descriptive/relevant excerpts from larger works.
  • 50% of the PDFs generated could be considered true articles in the bibliographic sense, complete with identifiable titles, authors, and subjects.
Metadata Accuracy Stats
Accuracy was measured on a scale of low to high for title, author, and subject


  • 55% of article title metadata was found to be highly accurate
  • 14% was considered medium = interpreted or modified
  • 27% was considered low = extrapolated from a non-obvious source OR the article title was available but a poorly descriptive article title was attributed or no article-level title was provided at all
  • 67% high accuracy, however formatting issues will need to be addressed to streamline differences in Firstname Lastname entries. Anything from Bianca Lipscomb to Lipscomb, B. to B. Lipscomb was found; users did not necessarily follow the formatting presented in the original text
  • 29% at medium accuracy, meaning that author names were either significantly abbreviated or interpreted from the source text
  • only 2% at low accuracy, meaning that no author was attributed to the article even though it should have been
Subjects were more difficult to analyze as compared with titles and authors above. For the most part, I was satisfied with only 6 instances of subject attribution, i.e. appropriate subject and geographic keywords. Many users simply neglected subjects or used the original title for the subject. Either way, I think it important that subjects are required metadata in order to trace associations between articles in the repository. This is, of course, coming from a librarian's perspective.

Please use the comment form below for questions.

Bianca Lipscomb
Collections Manager, Biodiversity Heritage Library
lipscombb (at) si (dot) edu