Thursday, March 27, 2014

BHL and EOL team up for NESCent Research Sprint

Research teams at the NESCent-EOL-BHL Research Sprint.
Photograph by Cyndy Parr.

In early February, the National Evolutionary Synthesis Center (NESCent) hosted the EOL-BHL Research Sprint. NESCent, based in Durham, NC, is a non-profit science center supporting research in the evolutionary sciences. NESCent emphasizes an interdisciplinary approach to research, and so the idea behind the Sprint was to put together teams of programmers and life scientists to expose each other to questions and ways of thinking that they might not necessarily consider in their normal work. Informaticians could bring programming and data skills to bear on questions that scientists may not have had the programming expertise to implement effectively, using BHL's and EOL's now considerable amount of freely available data. Scientists could identify questions based on the data to programmers that they might not have considered. Plus, the meeting was useful in identifying how well researchers could identify and retrieve the data they needed from the BHL text corpus. To this end, William Ulate, BHL Technical Director and John Mignault, a member of the BHL Technical Advisory Group attended the meeting.

The teams covered a wide variety of interesting topics from studying the color of butterflies based on extracting color information from images to studying changes in ontologies over time based on an analysis of the text in the BHL corpus (see http://bit.ly/1dnnhG0). Over the course of the sprint, the teams began data mining EOL and BHL for their data sets and started preliminary analyses of their data. Each day, groups met at the end of the day to share experiences and progress. By the end of the sprint, each of the teams were sharing plans for further collaboration and completing their analyses. Plans for publication and grants proposals based on sprint ideas were also discussed. In an open, collaborative spirit, members shared the materials freely via Google Drive.

We learned some interesting things about the way people approach the BHL data set. Many of the teams on the first day wanted to use the BHL application programming interface for bulk data retrieval. Several team members asked us how they could download "all of the text." When we told them that this was impractical and would result in a great deal of unwanted data, they asked how they could retrieve data based on, for example taxa - I want to harvest all pages with names from this taxon (Chordata) or this common name (Vertebrate). Others wanted data restricted by location. We tried to assist them given their specific needs rather than their initial request for the whole data set (see http://bit.ly/1rvbut3). This raised useful questions as to how we can provide the data to researchers need in the ways they need it - should we offer ways to request bulk data downloads based on a specific set of criteria? Should we alter the API (http://www.biodiversitylibrary.org/api2/docs/docs.html) in order to make it possible to retrieve more closely focused data sets? As BHL becomes better known as a source of "Big Data" for the biodiversity community, we will need to evolve our access to that data in order to better meet the needs of our users.

We were also surprised to discover the popularity of the R statistical programming language among scientists. Many team members used R in their work, to such an extent that a short R group discussion was scheduled for one morning during the meeting. Scott Chamberlain of Simon Fraser University has created an R interface to the BHL API, available at http://bit.ly/1oAFKjI. It is always good to see BHL and its data used in new and interesting ways. Follow up further results from this Sprint at: http://blog.eol.org.

The Sprint was a valuable meeting for BHL: it exposed our valuable data to more scientists and informaticians, and it gave BHL staff useful feedback on the uses of the BHL data corpus and its value to researchers. We would like to thank EOL, NEScent and the Richard Lounsbery Foundation for the opportunity and their collaboration in making this event a success.
     

Thursday, March 20, 2014

First Meeting of the Mining Biodiversity project

Meet our international partners to extract data from BHL 


Mining Biodiversity (MiBio project) is one of the projects that won during the third round of the transatlantic Digging Into Data Challenge, a competition aiming to promote the development of innovative computational techniques that can be applied to big data in the humanities and social sciences. The project is an international collaboration between the National Centre for Text Mining (UK), Missouri Botanical Garden (US) and Dalhousie University’s Big Data Analytics Institute (Canada) and Social Media Lab (Canada), along with colleagues from the Encyclopedia of Life and the Smithsonian Institution.

We will integrate novel text mining methods, visualization, crowdsourcing and social media into the BHL to provide a semantic search system that allows users to explore search results according to multiple information dimensions or facets.  The goal is to transform BHL into a next-generation social digital library resource that facilitates the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community.
Relations between the Work Packages of the project

The project has five major components, covered in 9 Work Packages (WP):

  1. Automatic correction of errors in OCR using Google n-grams by our colleagues of the Big Data Analytics Institute (WP2).
  2. Crowdsourcing the annotation of semantic metadata (concepts and events) in legacy texts (WP5).
  3. Extract metadata (terms, concepts and significant events) automatically and track their change over time (WP3 & WP4) to facilitate semantic search (WP6) implemented with NaCTeM.
  4. Use interactive visualization techniques to manage the search results, in collaboration with Dalhousie (WP7)
  5. Design a social media layer as an environment for interaction and collaboration on science, education, awareness and outreach, lead by our colleagues of the SocialMediaLab (WP8).

Manchester Town Hall
at Albert Square, UK
On February 17th, the first face to face meeting in Manchester, UK marked the start of this new project.  The Principal Investigators of the project, Dr. Anatoliy Gruzd from Dalhousie University (Canada) and William Ulate from Missouri Botanical Garden (USA), met with Dr. Sophia Ananiadou at the University of Manchester's National Centre for Text Mining (NaCTeM), where her colleagues involved in the project showcased the tools and services they have developed and will be adapting for our project.

The National Centre for Text Mining (NACTEM)

NaCTeM has developed text mining services based on a number of generic natural language processing tools like Argo, their Web-based workflow construction platform for text mining, implemented on top of the OASIS Unstructured Information Management Architecture (UIMA) standard for interoperability among information processing components.

Example of an Argo workflow that automatically extracts
species and anatomical features using entity tagging
components that NaCTeM has developed for this purpose.
Several of the NaCTeM tools have been developed as modules that can be adapted and used as components in workflows, receiving input from the previous module, processing or performing a task and passing the results to another module.  In the case of named-entity recognizers, these receive text pre-processed into smaller units (sentences, tokens) and extract features automatically according to statistical models used by different entity taggers (specialized in gene, chemical, anatomical, habitat or species information, for example).

Another type of components of the workflows, in addition to named entity recognizers, are the linkers, which facilitate the automatic linking of names or concepts found in text to entries in external vocabularies via unique identifiers and using a string similarity method.

Argo's functionality allows workflows to be deployed as a Web service so they can be invoked by external applications, just like BHL currently invokes Name-finding web services to find the taxa within the text.

At NaCTeM commenting on the tools for the project.
L to R:  Mr. John McNaught, Dr. Anatoliy Gruzd,
Dr. Sophia Ananiadou, Ms. Riza Batista-Navarro,
Mr. Georgios  Kontonatsios and Mr. Paul Thompson.
Missing from the photo: Dr. Rafal Rak,
Claudiu Mihăilă, and Dr. Ioannis Korkontzelos.
In order to develop these named entity recognizers and linkers for the biodiversity domain, it is necessary first, to identify which entity types are of interest (in our case, it could be names of persons, places, species, among others) and the vocabularies to link to for each type.  To assist on this process, NaCTeM has also developed term extraction tools like TerMine.  TerMine detects terms and acronyms in input text and can be used in building a term inventory for biodiversity.  This is what the initial task for our colleagues at Missouri Botanical Garden and Smithsonian will be about: finding those authoritative sources (vocabularies, ontologies, thesauri, gazetteers, etc.) of terms to help build the term inventory and then train the entity taggers to be used in our workflows.

NaCTeM has also done substantial work on event extraction, i.e., the extraction of  associations or interactions between concepts or entities.  This experience will help us identify and extract the type of events that scientists, historians and other scholars have long wanted to extract from the BHL corpus (like behavior, habitat, trophic relations, geographic range and others ) for our own named-entities: species, people, places throughout time.  Finally, NaCTeM's vast experience developing customized semantic search engines like KLEIO, ISHER and Europe PubMed Central EvidenceFinder will facilitate providing an enhanced semantic search functionality over the BHL corpus text, to allow users to explore results according to multiple information dimensions or facets.

Additional information on the tools and services can be found at:
For some interesting explanation of what Unstructured Information is and the terminology of the process around it, look at this nice introduction of the UIMA 1.0 Standard.

Or read more details about these or some of the other service systems and tools that NaCTeM has developed.

The Social Media Lab

On Monday February 17th, 2014, as part of the third Social Media Workshop, which covered the outreach and impact aspects of the International Centre for Social Media Research at Manchester University, our group was invited to attend a talk by our colleague in the project, Dr. Anatoliy Gruzd, where he presented the research done at the Dalhousie University Social Media Lab, how to make sense of the huge quantity of data and the new methods to collect information when studying online social networks through analysis and visualization.

Social Media Lab at
Dalhousie University, Canada
© All Rights Reserved  
For our project, the staff at Dalhousie has started investigating what users and communities, as well as the context in which they are currently accessing, commenting and sharing the records from BHL across various social media platforms, such as Twitter and Flickr.  For this work they will be employing some of their own tools developed by the Social Media Lab (such as Netlytic.org) as well as other tools developed by third parties.  Their goal is to add a social layer to integrate content from different biodiversity fora and social media sites with BHL via a user-friendly interface, to foster a community of users that could exploit BHL as an environment for sharing digital objects.

For more information, take a look at:
And some of Dr. Gruzd's and the Social Media Lab staff other publications.

I hope this gives an idea of the work ahead and a better sense of what the project attempts to do and how it aims to do it.  We will keep you informed as results become available, but in the meantime, let us know how do you envision yourself using BHL as a social digital library?  What information you'd like to track and how you'd like to access it?  Tell us the vocabularies you'd need to see included and what types of named entities and associations you'd want to be tagged in the BHL corpus?

William Ulate
BHL Technical Director
Missouri Botanical Garden

This project is made possible in part by a grant from the Institute for Museum and Library Services [Grant number LG-00-14-0032-14].

Tuesday, March 18, 2014

2014 Annual BHL meeting held in New York City, March 10-11, 2014

BHL member and affiliates met in New York City for the 2014 Annual Meeting (10-11 March 2014). The annual meeting is a chance for the leaders of BHL members and affiliates to learn what is happening around BHL and to give updates from their own institutions.

This year, the meeting was held jointly by the New York Botanical Garden and the American Museum of Natural History. The first day of meetings was hosted by Susan Fraser, Director of the LuEsther T. Mertz Library of the New York Botanical Garden. The morning session of the meeting included the 2014 BHL Program Director's Report by Martin R. Kalfatovic; an update on user engagement from Carolyn Sheffield (BHL Program Manager); an overview of BHL technical activities from William Ulate (BHL Technical Director); and a report on the recent Global BHL meetings and membership committee report by Vice-Chair Connie Rinaldo. Bob Corrigan, Director of Operations for the Encyclopedia of Life (EOL), also joined the meeting to give an update on EOL activities. Gregory Long, President and CEO of the New York Botanical Garden, welcomed the BHL members.

Attending the meeting were representatives from fifteen of the sixteen BHL members, including the three most recent members, Washington University of St. Louis, The National Library Board, Singapore (BHL Singapore), and the University of Illinois, Urbana-Champaign. Our newest affiliate, the Natural History Museum, Los Angeles County, also attended.

The business portion of the meeting took place the following day at the American Museum of Natural History, hosted by Tom Baione, the Harold Boeschenstein Director of the AMNH Research Library. Tom also gave the group a tour of Natural Histories: Exploring Rare Books and Scientific Illustration exhibition, based on his book of the same title.
Pictured above are the meeting attendees:
Front Row, left to right: Susan Fraser (NYBG), Chris Mills (Kew), Christine Giannoni (Field Museum), Tomoko Steen (Library of Congress), Eric Chin (BHL Singapore), Nancy Gwinn (Smithsonian Libraries), Cathy Buckwalter (ANSP), Judy Warnement (Harvard Botany Libraries).
Second Row, left to right: Tom Baione (AMNH), Marty Schlabach (Cornell), Connie Rinaldo (Harvard/Museum of Comparative Zoology), Carolyn Sheffield (BHL Program Manager), Diane Reilinger (MBL/WHOI), Richad Hulser (NHMLAC).
Third Row, left to right: Kelli Trei (UIUC), Doug Holland (Missouri Botanical Garden), Chris Freeland (Washington University).
Back Row, left to right: William Ulate (BHL Technical Director), Martin R. Kalfatovic (BHL Program Director). 
NOT PICTURED: Jane Smith (Natural History Museum, London). Photograph taken at the Enid A. Haupt Conservatory, New York Botanical Garden.

Thursday, March 6, 2014

5th Global BHL Meeting, Lorne, Australia


Representatives from BHL-Global nodes at the
5th Global BHL Meeting 
The 5th Global Biodiversity Heritage Library Meeting was held in Lorne, Australia, February 1-2, 2014.   Representatives from each of BHL’s global nodes, with the exception of BHL Egypt, convened to discuss the status of current goals, the formation of new goals, and to work together in forming the overall direction of BHL Global.  The meeting consisted of reports from the global nodes, the election of officers, and discussion of bylaws, technical issues and goals.

The first day of the meeting consisted of presentations delivered by representatives from BHL Central and the Global Nodes.

BHL Central
Kicking off the presentations, Martin Kalfatovic, BHL Program Director, reported on BHL Central’s continued growth.  BHL Central is now comprised of 15 dues-paying member institutions, with a collection of over 42 million pages, and usage statistics that include over 3 million visitors since BHL launched in 2007.  In other news, the latest version of the Macaw software developed at the Smithsonian Libraries is now being tested at Harvard, New York Botanical Garden and the California Academy of Sciences with the University of Pretoria to also begin testing soon.  With this release, users can now upload to a cloud server, after which the files go to the Internet Archive and then the BHL portal.

BHL Africa
Anne-Lise Fourie, Principal Librarian at South African National Biodiversity Institute (SANBI), shared the good news that two more institutions in Kenya have joined BHL.  In South Africa, institutions are sending digitized content to the University of Pretoria for quality assurance.  To date, the Steering Committee has met twice with the possibility of more frequent meetings of the regional representatives to help build and maintain momentum.

Grants and Social Media 
Connie Rinaldo, Vice Chair of the BHL Executive Committee and Librarian of the Ernst Mayr Library, Harvard Museum of Comparative Zoology, reported on BHL’s grant-funded projects and on the status of BHL Central’s social media efforts.  BHL currently has four active grants, two of which are about to wrap up and two that just recently kicked off.  Connecting Content, an IMLS grant led by the California Academy of Sciences Library, is linking field notes, specimens, and published literature.  Connie demonstrated MCZ-Harvard’s contributions with the William Brewster collection.  Concurrently, the Art of Life is exploring automated ways of locating illustrations in natural history literature and providing metadata for them.  Led by the Missouri Botanical Garden, this NEH grant will broaden and engage the BHL audience by integrating tagging applications so users can edit descriptive metadata, and integrating that user-generated metadata to enhance access to illustrations.  The two new grants—Purposeful Gaming and the BHL and Digging Into Data—are both funded by IMLS and led by the Missouri Botanical Garden.  Purposeful Gaming and the BHL will develop a game to crowdsource OCR corrections for seed catalogs and transcriptions of field notes.  Digging Into Data will explore new methods for the explore integration of text mining, visualization, crowdsourcing and social media for enhancing use of BHL content.

Social media has been a strong component of the BHL outreach strategy and in 2013 over 36,000 visits to the BHL website came from social media platforms (out of a total of 1.4 million visits).  With recent staff departures, BHL's social media presence is shifting to maintenance mode and we've seen a corresponding decrease in traffic.  A discussion ensued about how best to tailor outreach efforts for maximum impact with existing resources, including recent efforts in the education domain such as BHL Europe’s Historian app for teachers and BHL Africa’s push to teach younger students about the environment.

Encyclopedia of Life 
Nancy Gwinn, Chair of the BHL Executive Committee and Director, Smithsonian Libraries, presented on the recent EOL meetings in Canberra.  The meetings included demonstrations for new tool suites including the recently released Traitbank, which provides the capability to assemble similar traits from across species for comparisons.

Biodiversity Library Exhibitions 
Jiři Frank, Vice-Chair of BHL-Global, reported on the current status of the BHL-E exhibition software and the group discussed the idea of having a "Treasures of the Global BHL" online exhibition.  We’re very pleased that Connie Rinaldo and Jiři have both already graciously volunteered their time for coordination and training, respectively.

Following the presentations and going into the second day, attendees moved to setting the direction for Global BHL for the coming year.  Each of the Global BHL Officers--Ely Wallis, Jiři Frank and Nancy Gwinn--were re-elected to two year terms in the offices of Chair, Vice-Chair and Secretary, respectively.  One of the first tasks that the re-elected executive committee will be taking on is the review of the bylaws.

Action items for the global nodes were also identified and will help guide collection and technical development, outreach efforts, and overall growth for the BHL global nodes in 2014.  BHL Central will work with the global nodes on creating new collections of content for inclusion in the BHL Portal and on continued development of Macaw.  Based on their extensive experience and thanks to their existing resources, BHL Europe will develop a marketing plan for others to use as a model.  BHL Australia will coordinate the collection of input from API users to help inform new features and improvements.  Finally, all nodes have agreed to work together on recruiting new nodes to ensure representation of all continents.

All told, it was a very successful meeting with inspiring updates from all and some exciting new directions for BHL-Global.  We're looking forward to working with our colleagues across the globe on completing the tasks we have set out to accomplish and working towards an ever-growing and adaptive BHL!