BHL Technical Development: Year in Review
Building a More Resilient BHL: Improving Accessibility and Expanding Global Reach
What does it take to make millions of pages of biodiversity literature accessible to a global audience? For the Biodiversity Heritage Library (BHL) Technical Team, 2024 was a year of transformative milestones, new innovations, and overcoming challenges—all aimed at strengthening BHL’s mission of advancing biodiversity research.
BHL Data Continues to Improve
This year the BHL Technical Team made dramatic strides in improving BHL’s full-text search precision and retrieval. The Team finalized a two-year OCR reprocessing project aimed at upgrading BHL’s text files, which will improve overall full-text search accuracy by approximately 30% and taxonomic name recognition by 17%. Stay tuned for an in-depth update on the OCR reprocessing project from BHL’s Technical Coordinator, Joel Richard, later this year.

Reprocessing BHL text files improves full-text search accuracy and taxonomic name recognition. Image Credit: https://doi.org/10.3897/biss.7.112436
Additionally, fruitful collaborations with the Smithsonian Libraries and Archives (SLA) and the Smithsonian Transcription Center (STC) brought another 43,000+ pages of human-transcribed text into BHL from the Smithsonian Field Books Collection, which resulted in the addition of more than 151,000+ scientific names to the BHL search index. For more on Smithsonian transcription efforts, see the blog post entitled The Power of Community Science: How Smithsonian Volunpeers Transform Scientific Field Notes.
These improvements not only expand access points across BHL’s high-value archival materials but also help to interlink those materials with taxonomic databases across the web. BHL’s handwritten materials are notoriously difficult to search due to the fact that text recognition engines do not handle handwritten materials well. Handwritten Text Recognition (HTR) engines are changing this landscape quickly but human-transcribed materials remain the gold standard when it comes to providing the highest level of data quality for unique objects in BHL such as field notes, expedition logs, handwritten tables and other valuable primary source research material.

Handwritten Text Recognition (HTR) engines are a vast improvement over Optical Character Recognition (OCR) engines for processing handwritten materials, as seen from in this example from materials in BHL, but human transcription remains the gold standard in terms of quality. Image Credit: https://doi.org/10.3897/biss.7.112436
Lastly, BHL continues to strengthen its connections with global linked data platforms like Wikidata which has become an ultra rich collaborative data ecosystem that drives major search engines like Google, Duck Duck Go, and others. By adding over 63,000+ Wikidata Q IDs for BHL Titles, we continue to open up new knowledge pathways for researchers to explore and connect information in innovative ways well beyond the BHL website. To understand why persistent identifiers interlinked with knowledge bases on the web are critical for research infrastructure platforms like BHL, check out the related post: BHL is Round Tripping Persistent Identifiers with the Wikidata Query Service.
What are Virtual Items?
Equally transformative to BHL’s data quality gains was the introduction of “virtual items,” a feature that allows born-digital articles and e-content from data feeds like OAI-PMH and Crossref APIs to be grouped into cohesive BHL items. Virtual items enable modern journal articles to be presented alongside BHL’s traditionally digitized historic books and journals. To learn more about Virtual Items, please check out the BHL FAQ.

Although virtual items are generated from new data sources in BHL, the experience will hopefully be a seamless one for the average BHL user. Image Credit: BHL FAQ
Incorporating modern scholarly articles from non-traditional data sources has in the past posed challenges to BHL’s information architecture because it has meant accommodating additional publication standards, proactively and creatively sourcing more granular metadata required at the article level, and creating user interfaces capable of displaying multiple levels of resource description in an intuitive way for our users. By overcoming these challenges BHL is ensuring that the platform can accommodate not only historic literature but also modern biodiversity publications, ultimately helping bridge the gap between past and present biodiversity research.

The data flow journey for Virtual Items is actually quite different from traditional BHL content. For this new feature, the BHL Technical Team had to carefully consider the various data sources and how those would flow into the BHL data ecosystem and be presented to users. Image Credit: BHL Technical Team Collaborative Mapping
Facing Challenges, Strengthening Resilience
Despite major wins this past year for the BHL Technical Team, 2024 was not without its hurdles. After BHL completed several server migrations, a series of DDoS attacks in October targeting the Internet Archive (IA), a key partner and long-time host of BHL content, temporarily disrupted access to BHL materials. These events highlighted the importance of building a more resilient infrastructure for BHL, and having failsafes and additional data back-ups planned has been top of mind for the BHL Technical Team.
During the IA outage, we heard from many BHL users how the disruption affected their vital research and how grateful they were when access was restored.
“It is hard to quantify how vital a resource is until it is removed from reach! […] Thank you again for all the wonderful work done by BHL.”
“Life as I know it has come to a standstill. The @internetarchive is offline, which also affects @BioDivLibrary! What do I do!!! 😱”
“A huge amount of literature is only available through either Internet Archive itself or Biodiversity Heritage Library, which is hosted on the Internet archive. Stick in the wheel towards my work.”
“Dear Madam/Sir Really wonderful the BHL is back and provides accessibility to several hundred-year-old literature. Over the past two weeks [I have] not been able to do the majority of my work related to taxonomic curation of plant names of India. Thank you.”
IA continues to be a critical digitization and hosting partner for BHL. This year’s DDoS attacks on IA have only had the effect of helping BHL and IA strengthen infrastructure against future attacks.

Servers at the Internet Archive headquarters in San Francisco, CA. Image Credit: Jason Scott, Internet Archive | Wikimedia Commons.
Behind the Scenes: Building for the Future
In pursuit of a more resilient infrastructure for BHL, another major milestone was BHL’s inclusion in the AWS Open Data Sponsorship Program. Not only does the program provide BHL with free S3 storage for its data but BHL’s engagement with AWS also marks the beginning of an exploration into how cloud-based technologies can transform BHL’s back-end infrastructure. While BHL currently resides on on-premise storage and systems, cloud services offer new opportunities to rethink and refactor BHL to improve scalability, performance, and cost-efficiency. The benefits of a cloud-based or hybrid BHL are myriad, including complementary storage, faster page-serving speeds in low-bandwidth regions, and greater page persistence critical for biodiversity informatics applications. For the Press Release on BHL’s AWS Open Data Sponsorship, see Biodiversity Heritage Library Datasets Now Openly Accessible on the Amazon Web Services Cloud.

Exploring BHL in the Cloud in 2025 will be a major focus for the BHL Technical Team. We are very excited to see where the journey takes us! Image Credit: Canva AI Image Generator
Looking Ahead
As BHL moves into 2025, the BHL Technical Team’s focus remains on exploring cloud-based solutions and safeguarding BHL as a critical research infrastructure that provides access to over 62+ million pages of knowledge about life on Earth.
By embracing innovation and prioritizing accessibility, BHL will continue to evolve as a cornerstone resource for biodiversity research, empowering scientists and scholars to advance biodiversity knowledge and discovery.
For more on what’s in store for the year ahead, check out the 2025 BHL Technical Priorities.
Leave a Comment