Thursday, August 30, 2012

Interested in improving access to millions of digital images?

The Biodiversity Heritage Library (BHL) has made significant contributions to the research community over the past five years.  One of the largest has been to successfully digitize a significant mass of biodiversity literature (nearly 40 million pages) and make that literature available for open access and responsible use as a part of a global “biodiversity commons.”

Yet despite this success, BHL continues to have several challenges with access to and distribution of its digitized content.  One of which is the ability for users to easily find the millions of natural history illustrations hidden within the pages of the BHL corpus.  Only a small percentage of pages have been tagged as having illustrations because this is currently a labor-intensive manual task (a small selection of the diversity of BHL images can be viewed in its Flickr stream at   Once tagged, users still cannot search on the illustration’s content using criteria such as species names, dates, and creators because images have not been described at that level of detail.

The NEH-funded Art of Life project has set out to solve this problem both by developing an algorithm to automatically identify which pages contain illustrations and by creating a schema to further classify and guide the description of the illustrations so as to increase their accessibility to users.   Once the algorithm tags pages containing illustrations, they will be pushed out to image-sharing platforms such as Flickr and Wikimedia Commons for crowdsourcing of the descriptions.  The schema will provide guidance on the recording of fields and their values.  (see an example of a BHL illustration marked up with the Art of Life schema below)

Example of BHL illustration marked up with proposed Art of Life schema


Here’s how you can help

A draft  of the schema has been developed; we are looking for feedback on how well it will serve the needs of five primary audiences that we believe would benefit from access to these illustrations:   1) Artists, 2) Biologists, 3) Humanities Scholars, 4) Librarians, and 5) Educators. We particularly want to know if the schema incorporates the access points by which these user groups want to find images, or whether they might want to search for images based on fields not incorporated in the schema.

Whether you anticipate being a user of the illustrations from the BHL or you are a subject specialist or cataloger interested in helping us describe their content, we are interested in hearing from you as to how this schema may be improved to support the description of and access to these images.

We have provided a brief survey for feedback here:


Feedback can also be posted to this blog, added directly on the schema draft (with Google Docs comments) or emailed to me ( )

Trish Rose-Sandler, Data Analyst, Missouri Botanical Garden


  1. Fantastic project. I will watch with interest.

    For botanical illustrators, one thing I'd suggest is to add a descriptor if it is monochrome or colour. I'd also like to search on images larger than a certain size, so the dimensions of the image would be good.

    It would also be useful to be able to query on the higher order names and geography. An example would be to find all the colour illustrations for plants in the family Rutaceae that occur in Australia.

  2. Good points Peter. We're hoping the algorithm can do the detection and tagging of b/w vs. color so it doesn't have to be done manually. We expect that info would just be thrown into subject (do you think it needs its own element?)

    Re: recording of family names - I wonder if this could be accomplished by linking the query into a taxomonic name service like UBio's namebank rather than having to store the family name within the illustration's record?

    Re: recording of geographic locations - that could be useful too although difficult to determine from looking at the illustration or its caption. But the book from which it came often has geographic subjects affiliated with it and those terms could be extracted along with the bibliographic citation and attached to the image before it gets pushed to other image portals.

    1. Hopefully the algorithm will be able to distinguish monochrome as well. I'd probably vote for its own field, but there is a lot to be said for keeping the schema as simple as possible (but not too simple). As long as there is a way to filter on this characteristic.

      As for the geography and higher order names, there is no need to replicate these here if they can be linked from elsewhere.

  3. The wikimedia commons file has links out to original BHL page, and image posting on Flickr. Is the intent of the wikimedia to offer more of the metadata and linkages not available on these other two? And is it planned to be more two-way, i.e., going to BHL page shows the original book, but none of the illustration information, and gives no hint idea that illustration & page info. is documented elsewhere.

  4. CN,

    Yes the advantage of wikimedia commons vs. flickr is that the full schema can be implemented in it via templates whereas flickr will only be able to express a portion of the schema.
    And yes the envisioned workflow is - any metadata that gets added to the illustrations in both flickr or wikimedia commons will be brought back into the BHL portal for searching and viewing