|Research teams at the NESCent-EOL-BHL Research Sprint.
Photograph by Cyndy Parr.
In early February, the National Evolutionary Synthesis Center (NESCent) hosted the EOL-BHL Research Sprint. NESCent, based in Durham, NC, is a non-profit science center supporting research in the evolutionary sciences. NESCent emphasizes an interdisciplinary approach to research, and so the idea behind the Sprint was to put together teams of programmers and life scientists to expose each other to questions and ways of thinking that they might not necessarily consider in their normal work. Informaticians could bring programming and data skills to bear on questions that scientists may not have had the programming expertise to implement effectively, using BHL‘s and EOL‘s now considerable amount of freely available data. Scientists could identify questions based on the data to programmers that they might not have considered. Plus, the meeting was useful in identifying how well researchers could identify and retrieve the data they needed from the BHL text corpus. To this end, William Ulate, BHL Technical Director and John Mignault, a member of the BHL Technical Advisory Group attended the meeting.
The teams covered a wide variety of interesting topics from studying the color of butterflies based on extracting color information from images to studying changes in ontologies over time based on an analysis of the text in the BHL corpus (see http://bit.ly/1dnnhG0). Over the course of the sprint, the teams began data mining EOL and BHL for their data sets and started preliminary analyses of their data. Each day, groups met at the end of the day to share experiences and progress. By the end of the sprint, each of the teams were sharing plans for further collaboration and completing their analyses. Plans for publication and grants proposals based on sprint ideas were also discussed. In an open, collaborative spirit, members shared the materials freely via Google Drive.
We learned some interesting things about the way people approach the BHL data set. Many of the teams on the first day wanted to use the BHL application programming interface for bulk data retrieval. Several team members asked us how they could download “all of the text.” When we told them that this was impractical and would result in a great deal of unwanted data, they asked how they could retrieve data based on, for example taxa – I want to harvest all pages with names from this taxon (Chordata) or this common name (Vertebrate). Others wanted data restricted by location. We tried to assist them given their specific needs rather than their initial request for the whole data set (see http://bit.ly/1rvbut3). This raised useful questions as to how we can provide the data to researchers need in the ways they need it – should we offer ways to request bulk data downloads based on a specific set of criteria? Should we alter the API (http://www.biodiversitylibrary.org/api2/docs/docs.html) in order to make it possible to retrieve more closely focused data sets? As BHL becomes better known as a source of “Big Data” for the biodiversity community, we will need to evolve our access to that data in order to better meet the needs of our users.
We were also surprised to discover the popularity of the R statistical programming language among scientists. Many team members used R in their work, to such an extent that a short R group discussion was scheduled for one morning during the meeting. Scott Chamberlain of Simon Fraser University has created an R interface to the BHL API, available at http://bit.ly/1oAFKjI. It is always good to see BHL and its data used in new and interesting ways. Follow up further results from this Sprint at: http://blog.eol.org.
The Sprint was a valuable meeting for BHL: it exposed our valuable data to more scientists and informaticians, and it gave BHL staff useful feedback on the uses of the BHL data corpus and its value to researchers. We would like to thank EOL, NEScent and the Richard Lounsbery Foundation for the opportunity and their collaboration in making this event a success.