First Meeting of the Mining Biodiversity project

Meet our international partners to extract data from BHL

View Full Size Image
Mining Biodiversity (MiBio project) is one of the projects that won during the third round of the transatlantic Digging Into Data Challenge, a competition aiming to promote the development of innovative computational techniques that can be applied to big data in the humanities and social sciences. The project is an international collaboration between the National Centre for Text Mining (UK), Missouri Botanical Garden (US) and Dalhousie University’s Big Data Analytics Institute (Canada) and Social Media Lab (Canada), along with colleagues from the Encyclopedia of Life and the Smithsonian Institution.

We will integrate novel text mining methods, visualization, crowdsourcing and social media into the BHL to provide a semantic search system that allows users to explore search results according to multiple information dimensions or facets.  The goal is to transform BHL into a next-generation social digital library resource that facilitates the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community.

The project has five major components, covered in 9 Work Packages (WP):

  1. Automatic correction of errors in OCR using Google n-grams by our colleagues of the Big Data Analytics Institute (WP2).
  2. Crowdsourcing the annotation of semantic metadata (concepts and events) in legacy texts (WP5).
  3. Extract metadata (terms, concepts and significant events) automatically and track their change over time (WP3 & WP4) to facilitate semantic search (WP6) implemented with NaCTeM.
  4. Use interactive visualization techniques to manage the search results, in collaboration with Dalhousie (WP7)
  5. Design a social media layer as an environment for interaction and collaboration on science, education, awareness and outreach, lead by our colleagues of the SocialMediaLab (WP8).

On February 17th, the first face to face meeting in Manchester, UK marked the start of this new project.  The Principal Investigators of the project, Dr. Anatoliy Gruzd from Dalhousie University (Canada) and William Ulate from Missouri Botanical Garden (USA), met with Dr. Sophia Ananiadou at the University of Manchester’s National Centre for Text Mining (NaCTeM), where her colleagues involved in the project showcased the tools and services they have developed and will be adapting for our project.

The National Centre for Text Mining (NACTEM)

NaCTeM has developed text mining services based on a number of generic natural language processing tools like Argo, their Web-based workflow construction platform for text mining, implemented on top of the OASIS Unstructured Information Management Architecture (UIMA) standard for interoperability among information processing components.

Several of the NaCTeM tools have been developed as modules that can be adapted and used as components in workflows, receiving input from the previous module, processing or performing a task and passing the results to another module.  In the case of named-entity recognizers, these receive text pre-processed into smaller units (sentences, tokens) and extract features automatically according to statistical models used by different entity taggers (specialized in gene, chemical, anatomical, habitat or species information, for example).

Another type of components of the workflows, in addition to named entity recognizers, are the linkers, which facilitate the automatic linking of names or concepts found in text to entries in external vocabularies via unique identifiers and using a string similarity method.

Argo’s functionality allows workflows to be deployed as a Web service so they can be invoked by external applications, just like BHL currently invokes Name-finding web services to find the taxa within the text.

In order to develop these named entity recognizers and linkers for the biodiversity domain, it is necessary first, to identify which entity types are of interest (in our case, it could be names of persons, places, species, among others) and the vocabularies to link to for each type.  To assist on this process, NaCTeM has also developed term extraction tools like TerMine.  TerMine detects terms and acronyms in input text and can be used in building a term inventory for biodiversity.  This is what the initial task for our colleagues at Missouri Botanical Garden and Smithsonian will be about: finding those authoritative sources (vocabularies, ontologies, thesauri, gazetteers, etc.) of terms to help build the term inventory and then train the entity taggers to be used in our workflows.

NaCTeM has also done substantial work on event extraction, i.e., the extraction of  associations or interactions between concepts or entities.  This experience will help us identify and extract the type of events that scientists, historians and other scholars have long wanted to extract from the BHL corpus (like behavior, habitat, trophic relations, geographic range and others ) for our own named-entities: species, people, places throughout time.  Finally, NaCTeM’s vast experience developing customized semantic search engines like KLEIO, ISHER and Europe PubMed Central EvidenceFinder will facilitate providing an enhanced semantic search functionality over the BHL corpus text, to allow users to explore results according to multiple information dimensions or facets.

Additional information on the tools and services can be found at:

For some interesting explanation of what Unstructured Information is and the terminology of the process around it, look at this nice introduction of the UIMA 1.0 Standard.

Or read more details about these or some of the other service systems and tools that NaCTeM has developed.

The Social Media Lab

On Monday February 17th, 2014, as part of the third Social Media Workshop, which covered the outreach and impact aspects of the International Centre for Social Media Research at Manchester University, our group was invited to attend a talk by our colleague in the project, Dr. Anatoliy Gruzd, where he presented the research done at the Dalhousie University Social Media Lab, how to make sense of the huge quantity of data and the new methods to collect information when studying online social networks through analysis and visualization.

For our project, the staff at Dalhousie has started investigating what users and communities, as well as the context in which they are currently accessing, commenting and sharing the records from BHL across various social media platforms, such as Twitter and Flickr.  For this work they will be employing some of their own tools developed by the Social Media Lab (such as as well as other tools developed by third parties.  Their goal is to add a social layer to integrate content from different biodiversity fora and social media sites with BHL via a user-friendly interface, to foster a community of users that could exploit BHL as an environment for sharing digital objects.

For more information, take a look at:

And some of Dr. Gruzd’s and the Social Media Lab staff other publications.

I hope this gives an idea of the work ahead and a better sense of what the project attempts to do and how it aims to do it.  We will keep you informed as results become available, but in the meantime, let us know how do you envision yourself using BHL as a social digital library?  What information you’d like to track and how you’d like to access it?  Tell us the vocabularies you’d need to see included and what types of named entities and associations you’d want to be tagged in the BHL corpus?

View Full Size Image

This project is made possible in part by a grant from the Institute for Museum and Library Services [Grant number LG-00-14-0032-14].

Avatar for William Ulate
Written by

William Ulate served as the BHL Technical Director from 2012-2014. Prior to this, he served as BHL's Global Coordinator.