BHL Improves the Speed and Accuracy of its Taxonomic Name Finding Services with gnfinder

BHL has deployed a new taxonomic name finding tool to improve the speed and accuracy of identifying names throughout its 58+ million pages.

BHL is now using Global Names Architecture’s (GNA) gnfinder tool to locate taxonomic names in the BHL corpus. Prior to this deployment, BHL’s name finding services were based on an index of scientific names created by GNA developers six years ago by parsing every page in BHL one by one. This took 45 days to accomplish, and the cost of repeating this process made updating or improving the index infeasible.

The gnfinder tool uses fast, scalable programming languages to significantly reduce computational time. Using Open Source applications in Go and Scala, the tool detects candidate scientific names and compares them to millions of scientific name-strings aggregated by GNA for verification. The new process decreases the time needed for name detection and name verification from 35 days to 5 hours and from 7 days to 12 hours, respectively. As a result, the entire BHL corpus can now be indexed in less than a day, compared to the 45 days needed for the previous index. Additionally, by significantly reducing computational time, implementing iterative improvements to the index is now achievable.

The accuracy of the names identified has also been improved with this deployment. By eliminating questionable results and false positives from the previous index, gnfinder produces a more accurate index of names in BHL. More than 34 million unique names — representing more than 239 million total instances of taxonomic name strings — were identified across the BHL corpus as of 21 July 2020. Of these, approximately 11.7 million are “Verified Names”, meaning they are unique names that have been resolved against a name authority (NameBank, Catalogue of Life, etc).

The gnfinder tool was developed by Dmitry Mozzherin and Alexander Myltsev as part of GNA project work at the University of Illinois at Urbana-Champaign. Mozzherin shared more about the process of developing this tool at the Biodiversity Next conference in Leiden, The Netherlands in 2019. Learn more in the presentation slides.

You can learn more about how the BHL implementation of the gnfinder tool works in our FAQ.

We would like to thank our colleagues at Global Names Architecture — especially Dmitry Mozzherin, Alexander Myltsev, and David Patterson — for their work to develop these tools. Thanks also to Joel Richard (BHL Technical Coordinator and Head of Web Services and IT at Smithsonian Libraries) and Mike Lichtenberg (BHL Lead Developer) for their work to deploy gnfinder on the BHL website.

If you have questions about gnfinder or would like to provide feedback or suggestions, please contact Global Names Architecture via the Global Names BHL project on GitHub.

Global Names development on BHL indexing is supported by National Science Foundation grants #1356347 and #1645959 as well as the Species File Group at the University of Illinois.

Avatar for Grace Costantino
Written by

Grace Costantino served as the Outreach and Communication Manager for the Biodiversity Heritage Library from 2014 to 2021. In this capacity, she developed and managed BHL's communication strategy, oversaw social media initiatives, and engaged with the public to excite audiences about the wealth of biodiversity heritage available in BHL. Prior to her role as Outreach and Communication Manager, Grace served as the Digital Collections Librarian for Smithsonian Libraries and as the Program Manager for BHL.