BHL Technical Development: Year in Review

Building a More Resilient BHL: Improving Accessibility and Expanding Global Reach

What does it take to make millions of pages of biodiversity literature accessible to a global audience? For the Biodiversity Heritage Library (BHL) Technical Team, 2024 was a year of transformative milestones, new innovations, and overcoming challenges—all aimed at strengthening BHL’s mission of advancing biodiversity research.

BHL Data Continues to Improve

This year the BHL Technical Team made dramatic strides in improving BHL’s full-text search precision and retrieval. The Team finalized a two-year OCR reprocessing project aimed at upgrading BHL’s text files, which will improve overall full-text search accuracy by approximately 30% and taxonomic name recognition by 17%. Stay tuned for an in-depth update on the OCR reprocessing project from BHL’s Technical Coordinator, Joel Richard, later this year.

Snippet of improved OCR text and a list of benefits of reprocessing OCR, such as improved spelling, improved name-finding results, and more names revealed.

Reprocessing BHL text files improves full-text search accuracy and taxonomic name recognition. Image Credit: https://doi.org/10.3897/biss.7.112436

Additionally, fruitful collaborations with the Smithsonian Libraries and Archives (SLA) and the Smithsonian Transcription Center (STC) brought another 43,000+ pages of human-transcribed text into BHL from the Smithsonian Field Books Collection, which resulted in the addition of more than 151,000+ scientific names to the BHL search index. For more on Smithsonian transcription efforts, see the blog post entitled The Power of Community Science: How Smithsonian Volunpeers Transform Scientific Field Notes.

These improvements not only expand access points across BHL’s high-value archival materials but also help to interlink those materials with taxonomic databases across the web. BHL’s handwritten materials are notoriously difficult to search due to the fact that text recognition engines do not handle handwritten materials well. Handwritten Text Recognition (HTR) engines are changing this landscape quickly but human-transcribed materials remain the gold standard when it comes to providing the highest level of data quality for unique objects in BHL such as field notes, expedition logs, handwritten tables and other valuable primary source research material.

Four images comparing original handwritten text in BHL, processed via ABBYY Fine reader OCR, processed via Tesseract OCR, and processed via Google Cloud Vision HTR

Handwritten Text Recognition (HTR) engines are a vast improvement over Optical Character Recognition (OCR) engines for processing handwritten materials, as seen from in this example from materials in BHL, but human transcription remains the gold standard in terms of quality. Image Credit: https://doi.org/10.3897/biss.7.112436

Lastly, BHL continues to strengthen its connections with global linked data platforms like Wikidata which has become an ultra rich collaborative data ecosystem that drives major search engines like Google, Duck Duck Go, and others. By adding over 63,000+ Wikidata Q IDs for BHL Titles, we continue to open up new knowledge pathways for researchers to explore and connect information in innovative ways well beyond the BHL website. To understand why persistent identifiers interlinked with knowledge bases on the web are critical for research infrastructure platforms like BHL, check out the related post: BHL is Round Tripping Persistent Identifiers with the Wikidata Query Service.

What are Virtual Items?

Equally transformative to BHL’s data quality gains was the introduction of “virtual items,” a feature that allows born-digital articles and e-content from data feeds like OAI-PMH and Crossref APIs to be grouped into cohesive BHL items. Virtual items enable modern journal articles to be presented alongside BHL’s traditionally digitized historic books and journals. To learn more about Virtual Items, please check out the BHL FAQ

Comparison of a traditional digitized volume uploaded to BHL versus a modern born-digital virtual item uploaded to BHL

Although virtual items are generated from new data sources in BHL, the experience will hopefully be a seamless one for the average BHL user. Image Credit: BHL FAQ

Incorporating modern scholarly articles from non-traditional data sources has in the past posed challenges to BHL’s information architecture because it has meant accommodating additional publication standards, proactively and creatively sourcing more granular metadata required at the article level, and creating user interfaces capable of displaying multiple levels of resource description in an intuitive way for our users. By overcoming these challenges BHL is ensuring that the platform can accommodate not only historic literature but also modern biodiversity publications, ultimately helping bridge the gap between past and present biodiversity research.

Data flow diagram listing the data entities, data processes, and users of BHL data

The data flow journey for Virtual Items is actually quite different from traditional BHL content. For this new feature, the BHL Technical Team had to carefully consider the various data sources and how those would flow into the BHL data ecosystem and be presented to users. Image Credit: BHL Technical Team Collaborative Mapping

Facing Challenges, Strengthening Resilience

Despite major wins this past year for the BHL Technical Team, 2024 was not without its hurdles. After BHL completed several server migrations, a series of DDoS attacks in October targeting the Internet Archive (IA), a key partner and long-time host of BHL content, temporarily disrupted access to BHL materials. These events highlighted the importance of building a more resilient infrastructure for BHL, and having failsafes and additional data back-ups planned has been top of mind for the BHL Technical Team.

During the IA outage, we heard from many BHL users how the disruption affected their vital research and how grateful they were when access was restored.

“It is hard to quantify how vital a resource is until it is removed from reach! […] Thank you again for all the wonderful work done by BHL.”

“Life as I know it has come to a standstill. The @internetarchive is offline, which also affects @BioDivLibrary! What do I do!!! 😱

“A huge amount of literature is only available through either Internet Archive itself or Biodiversity Heritage Library, which is hosted on the Internet archive. Stick in the wheel towards my work.”

“Dear Madam/Sir Really wonderful the BHL is back and provides accessibility to several hundred-year-old literature. Over the past two weeks [I have] not been able to do the majority of my work related to taxonomic curation of plant names of India. Thank you.”

IA continues to be a critical digitization and hosting partner for BHL. This year’s DDoS attacks on IA have only had the effect of helping BHL and IA strengthen infrastructure against future attacks.

Three racks of servers labeled with Internet Archive name and logo

Servers at the Internet Archive headquarters in San Francisco, CA. Image Credit: Jason Scott, Internet Archive | Wikimedia Commons.

Behind the Scenes: Building for the Future

In pursuit of a more resilient infrastructure for BHL, another major milestone was BHL’s inclusion in the AWS Open Data Sponsorship Program. Not only does the program provide BHL with free S3 storage for its data but BHL’s engagement with AWS also marks the beginning of an exploration into how cloud-based technologies can transform BHL’s back-end infrastructure. While BHL currently resides on on-premise storage and systems, cloud services offer new opportunities to rethink and refactor BHL to improve scalability, performance, and cost-efficiency. The benefits of a cloud-based or hybrid BHL are myriad, including complementary storage, faster page-serving speeds in low-bandwidth regions, and greater page persistence critical for biodiversity informatics applications. For the Press Release on BHL’s AWS Open Data Sponsorship, see Biodiversity Heritage Library Datasets Now Openly Accessible on the Amazon Web Services Cloud.

AI generated image of a forest with animals and a cloud with the words BHL in the cloud.

Exploring BHL in the Cloud in 2025 will be a major focus for the BHL Technical Team. We are very excited to see where the journey takes us! Image Credit: Canva AI Image Generator

Looking Ahead

As BHL moves into 2025, the BHL Technical Team’s focus remains on exploring cloud-based solutions and safeguarding BHL as a critical research infrastructure that provides access to over 62+ million pages of knowledge about life on Earth.

By embracing innovation and prioritizing accessibility, BHL will continue to evolve as a cornerstone resource for biodiversity research, empowering scientists and scholars to advance biodiversity knowledge and discovery.

For more on what’s in store for the year ahead, check out the 2025 BHL Technical Priorities.

Avatar for JJ Dearborn
Written by

JJ Dearborn joined the Biodiversity Heritage Library as Data Manager in 2022 and works to open-up BHL data to the larger biodiversity community and the world. As a longtime advocate for the free-culture movement, she has worked on open access projects for the Peabody Essex Museum, Harvard University’s Department of Organismic and Evolutionary Biology, the Smithsonian Museum of Natural History, Harvard-Smithsonian Center for Astrophysics, the City of Boston, and the State of Massachusetts.