BHL Participates in the Global Names Workshop

gnames-workshop.jpg

Participants at the The Global Names Project workshop discuss progress in a morning “stand up” briefing. Photo by Deborah Paul (iDigBio).

The Global Names Project held a workshop on 17-19 June 2019 on the Campus of the University of Illinois at Urbana-Champaign. The workshop was titled Scientific names indexing and data mobilization of Biodiversity Heritage Library using tools from Global Names project and was hosted by the Species File Group at the  Illinois Natural History Survey. Eighteen people attended representing a variety of organizations interested in BHL content: Global Names Architecture, iDigBio, TaxonWorks, UIUC Species File Group, the Illinois Library, Encyclopedia of Life, the DINA Project, the Catalogue of Life, GBIF, Species File Group Argentina, the HathiTrust Research Center, and Global Biotic Interactions.

The workshop was organized as an unconference/hackathon in which the meeting is planned by all participants at the workshop. We initially all proposed topics we individually were interested in exploring; these were our “selfish goals”. In an exercise at the workshop, those goals were broken into similar or related topics. The most popular topics (see those sticky notes on the wall — in the background of the photo) became the focus of “pitches”, i.e. challenges that we could address at the workshop. We self-organized into working groups under the banner of pitch and got to work.

Note that at a hackathon, the goal is that you are always either “doing or learning.” For example, some of us learned how to mine BHL content using the Developer and Data Tools. And if you’d like to try it, you too can install and use the gnfinder and gnparser tools. The gnparser tool breaks scientific name-strings into the semantic elements of the string. While gnfinder searches text output (like OCR) for names.

Overall, the activities of the workshop centered around further improving the information that we can extract from the OCR (optical character recognition) content that is generated from the page images in BHL, including improving that OCR content itself.

One group focused on attempting to find Species Identification Keys in BHL. Using a versioned, citable, and verifiable snapshot of the BHL OCR text corpus1, the group discovered that a variety of ways in which a species identification key is labeled in the text combined with the natural inaccuracies of OCR make the task of identifying a heading for a key challenging2.

Another group worked on connecting the APIs of TaxonWorks, Global Names, and BHL. Their goal was to integrate information and resources from all three in a single interface that highlighted the BHL pages that species were originally described on. This group managed to wrap all three APIs in a single place (a “Task” in TaxonWorks), but problems with matching citation data across platforms prevented them from truly “closing the loop”.

Finally, the largest group focused on extracting different entities from the OCR content of the BHL, for example geographic names, people names, and organizations. This group experimented with a variety of natural language techniques and tools including the Edinburgh Geoparser, IBM Watson, Microsoft Azure Cognitive Services, and LingPipe and identified some additional challenges to extracting such entities from BHL. Not surprisingly, there is some overlap between place names and taxon names. For example, “St. Lucia” can be conflated with the genus “Lucia” (a type of butterfly), which certainly adds a hurdle for accurate entity identification.

The results of the workshop are being integrated into a Wiki that contains our initial goals and that invites other stakeholders to get involved. One direct outcome of the workshop is that the BHL will move to provide quarterly exports of the OCR, available to anyone, to mine and experiment with.  Previously, this content was not easily downloadable. The workshop discussions and hacking drove home the point that this corpus is a key element for future developments. Many other broader topics were also raised throughout the meeting. In particular, we explored the idea of opening a worldwide biodiversity informatics channel to better facilitate communication and share ideas among interested parties in real-time. This could be done using Slack.

Many thanks to the Global Names and the Illinois Natural History Survey for hosting, and especially Dima Mozzherin for all of his work on the Global Names Name Finding algorithm, which has opened the door to moving BHL’s content into the next decade.

References

[1] Poelen, Jorrit H. (2019). A biodiversity dataset graph: Biodiversity Heritage Library (BHL) (Version 0.0.1) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3251134

[2] Poelen, Jorrit H., Schulz, Katja, Trei, Kelli J., & Rees, Jonathan A. (2019, July 10). Finding Identification of Keys in the Biodiversity Heritage Library (Version 1.1). Zenodo. http://doi.org/10.5281/zenodo.3311815

Joel Richard is the head of Web and IT department for the Smithsonian Libraries and Archives, and the Technical Coordinator for the Biodiversity Heritage Library. Joel is also the creator and developer of the Macaw software used by BHL partners to add content to BHL.

Matt Yoder is a Biological Informatician at the Illinois Natural History Survey at the University of Illinois at Urbana-Champaign.

Debbie Paul is the Digitization and Workforce Development Manager at iDigBio.

Jorrit Poelen, independent software engineer, lives and works in Oakland were he uses frugal methods to link and preserve biodiversity data.

Mike Lichtenberg is the Lead Developer for the Biodiversity Heritage Library.