BHL Technical Development: Year in Review

In 2023, BHL’s Technical Team dedicated significant efforts to improve our data ecosystem, now comprising 61+ million pages of biodiversity literature. Last year’s Technical Priorities underscored BHL’s steadfast commitment to data quality by focusing on both upstream and downstream data flows. Notable milestones include delivering refined taxonomic data to researchers, implementing interface improvements based on user feedback, and forging data pipelines for existing and new downstream data consumers.

Collage of logos from downstream data consumers, including GBIF, figshare, WoRMS, gna, Tropicos, Wikimedia, Crossref, OCLC, BioStor, Catalogue of Life, unpaywall, and DPLA

A selection of BHL’s major downstream data consumers

Global Names Upgrades

A standout achievement in 2023 was the upgrade of BHL’s Global Names taxonomic intelligence tools, marking a significant leap forward in ensuring the accuracy and comprehensiveness of scientific name detection and verification. These upgrades replace the now deprecated GNRD tool and GNresolver services.

BHLIndex is a remarkable improvement over past performance of the Global Names taxonomic intelligence suite. A decade ago, finding and verifying all names in BHL took 45 days of computational processing time—today, the task takes less than 6 hours. Moreover, there is a notable increase in the overall number of name instances in the BHL corpus, underscoring significant improvements made in both speed and comprehensiveness. The Global Names Team observed the following performance improvements:

  • name-finding in 275,000 volumes, 60+ million pages: 2.5 hours;
  • name-verification of 23 million unique name-strings: 3 hours; and
  • preparing a CSV file with 250 million names occurrences/verification records: 40 minutes.

Additionally, Mike Lichtenberg, BHL Lead Developer, found that the BHLIndex upgrade yielded an additional 17,790,590 access points to scientific species names in the BHL corpus!

Graph indicating the number of scientific names found on BHL pages before (245,237,984) and after (263,028,574) implementing the new BHLIndex.

The BHL corpus now contains 17,790,590 additional scientific names.

In tandem with the BHLIndex taxonomic intelligence upgrades, the BHL Technical Team continues its ongoing efforts to enhance the underlying OCR data that powers the Global Names tools. Stay tuned for an update on the OCR Reprocessing Project later this year.

BHL is incredibly grateful to Dr. Dima Mozzherin, Dr. Geoff Ower, and all past and present contributors to the Global Names Architecture applications for their collaborative efforts with the BHL Technical Team to make these new upgrades and an additional 17+ million taxonomic access points available to BHL’s global user base. The continuous improvement of taxonomic data in BHL remains essential, driving key functionalities for our scientific researchers, including taxonomic search, species bibliographies, and interlinkages with other major taxonomic databases on the web.

BHL User Interface Enhancements

BHL has 588 contributors and counting. The addition of a Contributor Facet on the BHL search results page allows users to filter, facet, and hone in on publications from specific BHL contributor collections for the first time. This feature is particularly important to BHL Members and Affiliates who may want to refine their search to see only digitized content from their home collections to answer local research questions.

Example of BHL search results with new Contributor facet in left navigation panel

The contributor facet is a new way to hone in on a specific institutional collection within search results.

In addition to the new contributor facet, BHL has enhanced Supplementary links for Titles by allowing for additional links and incorporating a controlled vocabulary to precisely describe the external resource being linked to.

Example of a BHL title page listing External Resources (a Harvard collection guide for the Walter Deane papers) associated with the title in BHL

Supplementary links allow BHL Partners to curate their collections with relevant links from their home institution like a finding aid, digital exhibition, or collection guide.

Partner curated supplementary links allow BHL to link out to other relevant resources such as a collection guide, digital exhibition, or an archival finding aid. Kudos to the BHL Collection Committee for keeping a pulse on BHL user needs and requesting and scoping the requirements for these new interface features and enhancements.

Forging BHL Data Pipelines

Downstream consumers of BHL data have been a big focus for the BHL Technical Team this past year as well. The Global Biodiversity Information Facility (GBIF) and Crossref play integral roles in advancing biodiversity knowledge by facilitating seamless access to valuable scientific data sourced from BHL. Having accurate BHL data in these repositories furthers the Technical Team’s charge “to develop tools and services to facilitate greater access, interoperability, and reuse of BHL content and data.”


GBIF functions as a global repository for crucial biodiversity data, fostering a more comprehensive understanding of Earth’s biological diversity. Exposing species occurrence data from BHL in GBIF offers a tangible avenue for BHL partners to actively contribute to climate change and biodiversity conservation initiatives.

A significant accomplishment for the BHL Technical Team was the deposit of data at GBIF this year. This milestone marked the culmination of 18 months of intense investigation which was condensed into a 10-minute presentation given at TDWG 2023 entitled Unearthing the Past for a Sustainable Future: Extracting and transforming data in the Biodiversity Heritage Library for climate action.” 

Cover slide for TDWG 2023 presentation titled Unearthing the past for a sustainable future: extracting and transforming data in the Biodiversity Heritage Library for climate action

The BHL Technical Team presented at TDWG 2023 this year.

The presentation underscores the BHL Technical Team’s dedication to aligning BHL’s data management priorities with international climate-related initiatives, sustainability goals, and most importantly to address global environmental challenges. To learn more check out the conference recording and the related blog post Illuminating BHL’s Dark Data: Citizen Scientists and AI Unlock Key Biodiversity Data in GBIF.

Much of the BHL Technical Team’s investigative work was conducted in collaboration with the BHL Transcription Upload Tool Working Group (TUTWG) who performed crucial tasks such as crowdsourcing and analyzing a Handwritten Text Recognition sample set and providing already transcribed materials and data outputs to process and deposit with GBIF. In addition to their contributions to BHL’s data extraction efforts, the working group members also:

  • Provided user documentation and training on uploading transcription text files to BHL;
  • Updated archival material metadata in BHL for improved findability;
  • Defined an allowable mark-up policy for BHL Partner transcriptions;
  • Scoped requirements for a new OCR text indicator in the BHL interface; and
  • Contributed a valuable GLAM resource to the BHL knowledge base in the form of the Transcription Platform Comparison chart to aid institutions in the overwhelming task of selecting a transcription platform suited to their particular needs.
Transcription platform comparison chart listing four platforms (DigiVol, FromThePage, Wikisource, and Zooniverse) and their key features.

Explore the Transcription Platform Comparison Chart for more details.

A huge shout out goes to all BHL TUTWG members and collaborators for their invaluable contributions.


Complementing the new pilot GBIF data pipeline is the improvement of an existing one. Crossref ensures the comprehensive and accurate association of BHL’s bibliographic metadata with digital object identifiers (DOIs), enabling the efficient citation and linking of scholarly content in the modern publishing environment. Ensuring that historic knowledge is served up in modern discovery layers can be incredibly challenging work and we are so grateful to the Persistent Identifier Working Group (PIWG) for their consistent efforts in the form of energy, time, and enduring patience in 2023. The working group in collaboration with the BioStor project has been helping BHL broaden its collection horizons and illuminate BHL dark data since 2012.

In 2023 the group piloted and brought to production the submission of improved DOI data deposits for titles that are part of a monographic series (which provide notoriously difficult bibliographic conundrums for BHL Staff). The use of a richer deposit schema allows for a more complete set of metadata to flow downstream to Crossref.

And More Data 

Since last year, BHL has been openly publishing all of its data on the Smithsonian Figshare Data Repository. The BHL Open Data Collection uses the Data Catalog Vocabulary (DCAT) to describe and document what exactly is in all that data. Due to popular demand, we’ve added two new BHL datasets for our users to experiment with:

Richard, Joel; Dearborn, Jacqueline (2023). BHL Optical Character Recognition (OCR) – Full Text Export (new). Smithsonian Libraries and Archives. Dataset.

Dearborn, Jacqueline; Lichtenberg, Mike (2023). BHL Flickr Harvest Data. Smithsonian Libraries and Archives. Dataset.

The Flickr harvest data was the product of a workshop called Transforming Biodiversity Heritage Library Images – Data Modeling with OpenRefine. This event was hosted in collaboration with Wikimedia Foundation’s Giovanna Fontenelle and Wikimedian Sandra Fauconnier as part of Wikimedia’s Image Description Month.

BHL Staff, Wikimedians, Flickr staff, and all biodiversity image enthusiasts together are beginning the process of mapping and loading BHL Flickr image data to Structured Data on Commons (SDC), a new Wikimedia Commons initiative. This project was the second top-voted priority from the BHL Wikimedia white paper entitledUnifying Biodiversity Knowledge to Support Life on a Sustainable Planet

The Importance of Good Documentation

Fundamental to all technology projects is good documentation, and BHL is no different. The BHL Consortium’s collective endeavor to deliver over 500 years of data about biodiversity to the world relies crucially on the availability of comprehensive documentation about BHL’s extensive data ecosystem. To this end, the BHL Technical Team is working hard to ensure that BHL’s data is “useful, usable, and used” by putting greater emphasis on documentation this year. We want to make sure that all BHL users and collaborators possess the knowledge to effectively leverage BHL data for their research, apps, discovery layers, visualizations and so much more!

BHL data flows diagram listing external data entry, internal data entry, data processes, and internet consumers of data

Helping users understand how data flows in and out of the BHL data ecosystem and sharing it with the world is an overarching goal for the BHL Technical Team.

With over 17 years of development, BHL’s data model has matured significantly, enabling the harvesting, processing, and delivery of rich and robust data about our collections. The BHL Technical Team’s commitment to a high-quality, well documented data ecosystem underpins all BHL initiatives aimed at enhancing the discovery and access to BHL Partner and Contributor collections worldwide.

For a sneak peak of what is in store for the year ahead, check out the 2024 BHL Technical Priorities.

Avatar for JJ Dearborn
Written by

JJ Dearborn joined the Biodiversity Heritage Library as Data Manager in 2022 and works to open-up BHL data to the larger biodiversity community and the world. As a longtime advocate for the free-culture movement, she has worked on open access projects for the Peabody Essex Museum, Harvard University’s Department of Organismic and Evolutionary Biology, the Smithsonian Museum of Natural History, Harvard-Smithsonian Center for Astrophysics, the City of Boston, and the State of Massachusetts.