BHL News, Blog Reel, Tech Updates

PDF Article Metadata Analysis

In previous posts we have discussed the issues surrounding the identification of articles contained within BHL scanned books and the new interface we’ve developed that let’s users build their own PDFs for download. In that interface(demo for The Journal of agricultural science, v.7) we ask users who are building a PDF of an article to contribute the article title, author(s), and subjects/tags and we’ll store that information alongside the generated PDF and make it available for other users to search and download.

As this was our first attempt at crowdsourcing, we didn’t know what kind of data quality to expect. We have been monitoring the data submitted since releasing the functionality in January, and formed the criteria for a more formal analysis. After reviewing the metadata for a sample of 50 PDFs out of a total of 802 generated between January 15, 2009 and the end of March the following trends were revealed:

Overview

88% of articles were assigned article-level titles by users, indicating that they are comfortable entering metadata without a great deal of prompting. So far, the only guidance in the interface is “Are you generating a PDF containing the text of a single journal article or book chapter? If so, please help us out by providing the following information!”
22% of the PDFs generated could not be considered true articles. They were determined to be arbitrary selections of pages.
24% of the PDFs generated were not articles in the bibliographic sense but were species descriptive/relevant excerpts from larger works.
50% of the PDFs generated could be considered true articles in the bibliographic sense, complete with identifiable titles, authors, and subjects.

Metadata Accuracy Stats
Accuracy was measured on a scale of low to high for title, author, and subject

Titles

55% of article title metadata was found to be highly accurate
14% was considered medium = interpreted or modified
27% was considered low = extrapolated from a non-obvious source OR the article title was available but a poorly descriptive article title was attributed or no article-level title was provided at all

Authors

67% high accuracy, however formatting issues will need to be addressed to streamline differences in Firstname Lastname entries. Anything from Bianca Lipscomb to Lipscomb, B. to B. Lipscomb was found; users did not necessarily follow the formatting presented in the original text
29% at medium accuracy, meaning that author names were either significantly abbreviated or interpreted from the source text
only 2% at low accuracy, meaning that no author was attributed to the article even though it should have been

Subjects
Subjects were more difficult to analyze as compared with titles and authors above. For the most part, I was satisfied with only 6 instances of subject attribution, i.e. appropriate subject and geographic keywords. Many users simply neglected subjects or used the original title for the subject. Either way, I think it important that subjects are required metadata in order to trace associations between articles in the repository. This is, of course, coming from a librarian’s perspective.

Please use the comment form below for questions.

analysis, metadata, PDFs

April 15, 2009

Written by Bianca Crowley

Bianca Crowley is the Digital Collections Manager for the Biodiversity Heritage Library, headquartered at Smithsonian Libraries and Archives. She has spent her career in this role helping consortium Partners grow and curate BHL's collection. Her main responsibilities revolve around program administration and collection management, but you can also find her tackling technical development, documentation, copyright, and cataloging issues as time allows. She received her MSLIS from The Catholic University of America.

4 Comments

Mobile Surveillance February 10, 2010 at 6:28 am Reply

This is great, actually I don't know about it, It is really very helpful for me. I used simple pdf creation techniques till now. but its interesting & I have to start it soon.
dinesh June 9, 2009 at 10:45 am Reply

I recently used the pdf creation feature, but was not aware that title etc metadata would be used in this way, and so I might have unwittingly contribute to its inaccuracy. Why not make the filling in of metadata a compulsory requirement, and I'm sure people will be happy to do so.

And another thing, why is pdf creation limited to 50 pages?
FabulousLadyB April 22, 2009 at 10:37 am Reply

The accuracy results were based on the total sample. 2 articles out of the sample were not analyzed because one was much too large to download, over 160+ MBs and the other was a repeat. ~4% of sample not analyzed.
Martin April 15, 2009 at 7:05 pm Reply

Were the metadata accuracy results based on the total sample or on the 88% that were deemed to have some level of article level metadata?

Cancel Reply

About BHL

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. Headquartered at the Smithsonian Libraries and Archives in Washington, D.C., BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access as part of a global “biodiversity community.”

PDF Article Metadata Analysis

Related Posts

4 Comments

Leave a Comment

Cancel Reply

Help Support BHL

Search

About BHL

Follow BHL

Join Our Mailing List

Subscribe to our Blog Via RSS

PDF Article Metadata Analysis

Related Posts

What Is BHL’s New Persistent Identifier Working Group DOI’ng?

Reflecting back on my incredible summer at Smithsonian Libraries

BHL is Round Tripping Persistent Identifiers with the Wikidata Query Service

4 Comments

Leave a Comment

Cancel Reply

Help Support BHL

Search

About BHL

Follow BHL

Join Our Mailing List

Subscribe to our Blog Via RSS