Meet our international partners to extract data from BHL
We will integrate novel text mining methods, visualization, crowdsourcing and social media into the BHL to provide a semantic search system that allows users to explore search results according to multiple information dimensions or facets. The goal is to transform BHL into a next-generation social digital library resource that facilitates the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community.
Relations between the Work Packages of the project |
The project has five major components, covered in 9 Work Packages (WP):
- Automatic correction of errors in OCR using Google n-grams by our colleagues of the Big Data Analytics Institute (WP2).
- Crowdsourcing the annotation of semantic metadata (concepts and events) in legacy texts (WP5).
- Extract metadata (terms, concepts and significant events) automatically and track their change over time (WP3 & WP4) to facilitate semantic search (WP6) implemented with NaCTeM.
- Use interactive visualization techniques to manage the search results, in collaboration with Dalhousie (WP7)
- Design a social media layer as an environment for interaction and collaboration on science, education, awareness and outreach, lead by our colleagues of the SocialMediaLab (WP8).
Manchester Town Hall at Albert Square, UK |
On February 17th, the first face to face meeting in Manchester, UK marked the start of this new project. The Principal Investigators of the project, Dr. Anatoliy Gruzd from Dalhousie University (Canada) and William Ulate from Missouri Botanical Garden (USA), met with Dr. Sophia Ananiadou at the University of Manchester’s National Centre for Text Mining (NaCTeM), where her colleagues involved in the project showcased the tools and services they have developed and will be adapting for our project.
The National Centre for Text Mining (NACTEM)
Example of an Argo workflow that automatically extracts species and anatomical features using entity tagging components that NaCTeM has developed for this purpose. |
Several of the NaCTeM tools have been developed as modules that can be adapted and used as components in workflows, receiving input from the previous module, processing or performing a task and passing the results to another module. In the case of named-entity recognizers, these receive text pre-processed into smaller units (sentences, tokens) and extract features automatically according to statistical models used by different entity taggers (specialized in gene, chemical, anatomical, habitat or species information, for example).
Another type of components of the workflows, in addition to named entity recognizers, are the linkers, which facilitate the automatic linking of names or concepts found in text to entries in external vocabularies via unique identifiers and using a string similarity method.
Argo’s functionality allows workflows to be deployed as a Web service so they can be invoked by external applications, just like BHL currently invokes Name-finding web services to find the taxa within the text.
In order to develop these named entity recognizers and linkers for the biodiversity domain, it is necessary first, to identify which entity types are of interest (in our case, it could be names of persons, places, species, among others) and the vocabularies to link to for each type. To assist on this process, NaCTeM has also developed term extraction tools like TerMine. TerMine detects terms and acronyms in input text and can be used in building a term inventory for biodiversity. This is what the initial task for our colleagues at Missouri Botanical Garden and Smithsonian will be about: finding those authoritative sources (vocabularies, ontologies, thesauri, gazetteers, etc.) of terms to help build the term inventory and then train the entity taggers to be used in our workflows.
NaCTeM has also done substantial work on event extraction, i.e., the extraction of associations or interactions between concepts or entities. This experience will help us identify and extract the type of events that scientists, historians and other scholars have long wanted to extract from the BHL corpus (like behavior, habitat, trophic relations, geographic range and others ) for our own named-entities: species, people, places throughout time. Finally, NaCTeM’s vast experience developing customized semantic search engines like KLEIO, ISHER and Europe PubMed Central EvidenceFinder will facilitate providing an enhanced semantic search functionality over the BHL corpus text, to allow users to explore results according to multiple information dimensions or facets.
Additional information on the tools and services can be found at:
- Rak, R., Rowley, A., Black, W.J. and Ananiadou, S. (2012). Argo: an integrative, interactive, text mining-based workbench supporting curation. Database: The Journal of Biological Databases and Curation.
- Kolluru, B., Nakjang, S., Hirt, R. P, Wipat, A. and Ananiadou, S.. (2011). Automatic extraction of microorganisms and their habitats from free text using text mining workflows. In: Journal of Integrative Bioinformatics, 8(2), 184
For some interesting explanation of what Unstructured Information is and the terminology of the process around it, look at this nice introduction of the UIMA 1.0 Standard.
The Social Media Lab
On Monday February 17th, 2014, as part of the third Social Media Workshop, which covered the outreach and impact aspects of the International Centre for Social Media Research at Manchester University, our group was invited to attend a talk by our colleague in the project, Dr. Anatoliy Gruzd, where he presented the research done at the Dalhousie University Social Media Lab, how to make sense of the huge quantity of data and the new methods to collect information when studying online social networks through analysis and visualization.
Social Media Lab at Dalhousie University, Canada © All Rights Reserved |
For our project, the staff at Dalhousie has started investigating what users and communities, as well as the context in which they are currently accessing, commenting and sharing the records from BHL across various social media platforms, such as Twitter and Flickr. For this work they will be employing some of their own tools developed by the Social Media Lab (such as Netlytic.org) as well as other tools developed by third parties. Their goal is to add a social layer to integrate content from different biodiversity fora and social media sites with BHL via a user-friendly interface, to foster a community of users that could exploit BHL as an environment for sharing digital objects.
For more information, take a look at:
- Gruzd, A. & Haythornthwaite, C. (2013). Enabling Community through Social Media. Journal of Medical Internet Research 15(10):e248. [DOI:10.2196/jmir.2796]
And some of Dr. Gruzd’s and the Social Media Lab staff other publications.
I hope this gives an idea of the work ahead and a better sense of what the project attempts to do and how it aims to do it. We will keep you informed as results become available, but in the meantime, let us know how do you envision yourself using BHL as a social digital library? What information you’d like to track and how you’d like to access it? Tell us the vocabularies you’d need to see included and what types of named entities and associations you’d want to be tagged in the BHL corpus?
This project is made possible in part by a grant from the Institute for Museum and Library Services [Grant number LG-00-14-0032-14].
Leave a Comment