The Biodiversity Heritage Library is committed to providing free and open access to over 500 years of natural history literature from across the globe. Towards that goal, the Library currently contains over 45 million pages of biodiversity content, representing over 155,000 volumes and 90,000 titles.
However, hosting the content online is just part of our vision to “inspire discovery through free access to biodiversity knowledge.” Users must be able to identify content relevant to their work and interest from amongst this vast corpus. For this, metadata is all-important.
Metadata is “data that describes other data.” BHL’s metadata describes the digital resources in our collection, providing not only author, title, volume, publication year and place, and article information, but also keywords describing the topics discussed within each book. Eventually, full-text searching will also allow users to search across the actual text within a book to discover items relevant to their search parameters.
|Dr. Jane Bromley, with her home-grown banana plant.|
Dr. Jane Bromley knows the importance of metadata all too well. It is a critical component of her daily work.
Bromley is a research fellow at The Open University, where she has worked since 2012 as part of a subgroup of the Natural Language Processing group. Under the EU FP7 funded agINFRA project, which aims to promote data sharing in agricultural sciences, Bromley’s subgroup studies information extraction from legacy biodiversity literature. Specifically, they are seeking to enhance an existing specialist agricultural resource, AGRIS.
AGRIS is a collaborative network of more than 150 institutions providing free access to agricultural information in the form of more than 7 million bibliographic references on agricultural research and technology. This multilingual bibliographic database also contains links to related data resources on the Web.
BHL contains a vast amount of agricultural information. Searching the subject “agriculture” alone produces over 2,100 books and journals in BHL. Recognizing the potential of these resources, Bromley and her colleagues, including Dr. David King and Dr. David Morse, developed a process to enhance AGRIS with BHL content. They relied on BHL’s metadata to do this.
“BHL is a unique resource for agricultural researchers,” explains Bromley. “Its long-term view can prove invaluable in locating wild relatives of crops and understanding their relationship to local habitats and ecosystems. It is the only way to access this breadth of biodiversity literature electronically.”
To filter BHL for relevant agricultural content, Bromley and her team downloaded the “Title Table” file from BHL, which is a list of all titles available in BHL with the associated URL and Call Number. After also downloading the bibliographic information for each item in MODS format, Bromley used the Call Number field to discriminate agricultural material using the Library of Congress Classification LCC scheme. While Call Numbers are alphanumeric codes that identify the shelf location of an item in a library, the specific combination of letters and numbers used allows the library to arrange items according to subject, and thus also serves as an indication of the subject of the book itself.
“I selected items with code starting “S, SB, SD, SF, SH, SK” meaning class Agriculture or one of its subclasses,” articulates Bromley. “I then selected only those items whose bibliographic data said they were of genre: book, thesis, article or bibliography, as AGRIS only accepts: books, book chapters, thesis, journal articles, conference papers, and bibliography. BHL contains complete journals rather than journal articles, so these were not included. That meant that items such as Canadian Journal of Agricultural Science or Journal of Agricultural Research were omitted. For items that passed both of these matches I also scraped the item’s URL to recover the location of the PDF image of the item and the OCR text.”
Finally, Bromley wrote a new MODS file for each item matching her filtering criteria, resulting in 12,645 MODS files with the URLs to the associated PDF and OCR files appended. Each item’s MODS data was converted and included in AGRIS by the AGRIS team, and are available here.
Bromley’s experiments with filtering BHL are written up as a conference paper that she presented at the 8th Metadata and Semantics Research Conference in Karlsruhe as part of a special track Metadata and Semantics for Open Repositories, Research Information Systems and Data Infrastructures, jointly chaired by Imma Subirats (FAO) and by Nikos Houssos (Greek National Documentation Centre). You can download a copy of the paper here: http://oro.open.ac.uk/41117/
Bromley’s filtering strategy resulted in high precision but lower recall. Nearly all of the selected material was about agriculture, but many items that also contain agricultural information but are classified under different call numbers were omitted. To address this omission, Bromley would eventually like to download the OCR text for every item and mine it for agriculturally related terms.
David Livingstone’s Missionary Travels and Researches in South Africa is a fantastic example of materials that are missed when strictly Call Number strategies are employed.
“This was the document that proved to me that in order to find all the relevant agricultural material in BHL we need to mine the whole texts,” emphasizes Bromley. “It’s a seminal book in the British consciousness, which I never thought I’d get to read as part of my research. And, interestingly, it turns out to be a key document in my research. Despite containing information about: domestic animals, The Boers as Farmers, Discovery of grape-bearing vines, The sugar-cane, Coffee Estate, and Coffee Plantations amongst others (all listed in the table of contents), which are all relevant to agricultural research, there is no way to tell from the title or the Subjects that it contains these nuggets.”
|“Bakalahari women filling their egg-shells and water-skins at a pool in the desert.” Livingstone, David. Missionary Travels and Researches in South Africa (1858). http://biodiversitylibrary.org/page/39590153.|
BHL staff are constantly working to improve our Library’s metadata. Ongoing work to enhance pagination information, merge duplicate title and author entries, associate related titles through bibliographic hyperlinks, and generally correct any errors that exist is tackled by a team of librarians throughout our consortium. Furthermore, current projects such as Mining Biodiversity and Purposeful Gaming are supporting OCR correction that will eventually enable full-text searching. The NEH Art of Life project has not only allowed us to systematically identify and manually classify illustrations throughout the BHL corpus, but in the next few weeks we will be calling for volunteers to help us tag a set of images in Flickr with keywords that describe the illustrations’ content. These tags will eventually be ingested into BHL to further enhance our metadata.
You can help us improve our metadata! If you notice a problem or area for improvement, send us feedback! Real BHL librarians will not only answer your feedback but use it to help direct our metadata improvement efforts.
We are honored to present Dr. Bromley’s work using BHL and its metadata to enhance biodiversity databases, expanding the reach of our content to new audiences and supporting a wide range of research initiatives. Do you have an example of how you’ve used BHL to support your research? Tell us about it by sending an email to firstname.lastname@example.org.