BHL Technical Development: Year in Review
Critical Upgrades
For BHL, 2022 was a year to focus on critical upgrades for the BHL platform to ensure the sustainability of our services for our global users. Although BHL’s basic technical infrastructure remains the same, consisting of years of refinement, knowledge, and reliability, a few updates were definitely in order. BHL major upgrades included:
- Upgraded .NET Framework to version 6 (see What’s New in Version 6? for a rundown of the benefits for BHL)
- Upgraded Elasticsearch from 5.4.2 to v.7.17 (search functions remain the same)
- Converted SOAP services to REST (two SOAP services used by the BHL website and utility applications have been converted to REST services implemented in .NET 6)
- Upgraded Global Names GNFinder tool from 0.19.5 to 1.0.0
Most of these upgrades were “behind-the-scenes” work and would not be noticeable to a majority of our users. However, keeping up with these important enhancements is a crucial component of any technology project. To learn more about the technology stack that drives BHL, click on the above links for a deep dive into some of the products and services we utilize to make BHL a reality.
New Features and Enhancements
In 2022, BHL deployed three new exciting user-driven features on the BHL website: the BHL text source indicator, the author record sidebar, and pre-generated article PDFs.
Text Source Indicator
Many of our users would like to know where BHL’s text output comes from. The Show Text tab now gives BHL users more information about whether the content has been manually transcribed by a human versus automatically generated by a machine via an OCR engine like Tesseract or ABBYY FineReader.
The Text Source Indicator helps our users understand and determine what quality they should be expecting from the item they are viewing.
Many thanks to the BHL Transcription Upload Tool Working Group (TUTWG) for this important feature request and providing the critical feedback to ensure the Text Source Indicator is useful to BHL’s global user base.
Author Record Sidebar
BHL is now exposing author data to help open-up new research pathways beyond the BHL portal. The Author Record sidebar was born out of a response to user requests from two Wikimedians, Siobhan Leachman and Andy Mabbett, who provided critical feedback that they would like to see Wikidata Q numbers exposed alongside author names to further facilitate downstream name disambiguation work.
A special thanks to Siobhan, Andy, and the BHL Cataloging and Metadata Committee for their hard work and dedication to curating and disambiguating author names and helping to realize the new author records display in BHL. Look out for a forthcoming blog post from the Committee, during International Love Data Week 2023, which will detail how the group was able to harvest 88,000+ author identifiers from Wikidata and bring them back into BHL’s database.
Pre-generated Article PDFs
In 2022, the BHL Technical Team announced pre-generated PDFs for articles defined in BHL. While BHL had previously allowed users to download custom PDFs, the benefits of pre-generated article PDFs are manifold:
- No waiting.
- No selecting pages.
- The PDF contains embedded, searchable, copy-paste-able text.
- The PDF contains rich XMP-based metadata about the article.
The addition of these pre-generated PDFs article landing pages in BHL also allows content aggregators like Unpaywall to find BHL’s open access versions of paywalled literature and serve BHL content up to their user base via their free browser extension.
Many thanks to BHL’s Persistent Identifier Working Group (PIWG) and their ongoing communications with the Unpaywall Development Team to ensure BHL content is surfaced via the Unpaywall browser extension.
Data Quality Improvements
The BHL Technical Team and all of BHL’s Committees and Working Groups care passionately about the quality of BHL’s data. To support the various priorities around data management in BHL’s Strategic Plan, BHL announced the new position of BHL Data Manager in 2022. In this new role, the Data Manager leads the effort to develop and implement a comprehensive view of how BHL collections can be optimized to support the interoperability of BHL data in the larger biodiversity community.
Like any big data repository there are anomalies, errors, and omissions in BHL data as a result of aggregating records from hundreds of contributors and technology projects distributed all over the world. The way we collectively categorize, classify, and describe our materials can vary vastly across so many collaborating organizations that comprise the BHL network. In 2022, the data quality improvements of note included reprocessing BHL’s OCR files, updating the material type facet, and creating an open data collection with a new 40GB OCR text export.
Reprocessing BHL’s OCR files
BHL digitization partner, the Internet Archive (IA), has upgraded their OCR engine from ABBYY FineReader to the Tesseract Open Source OCR engine. BHL is working with IA to reprocess some of BHL’s oldest content with the newest available version of Tesseract OCR. Approximately 120,000 BHL items are eligible to be upgraded and at our current rate of progress of approximately 100 items per day, it will take close to two years to complete the BHL OCR reprocessing project. For more information check out OCR Improvements: An Early Analysis by BHL’s Technical Coordinator, Joel Richard.
Material Type Facet Update
Archival materials digitized for BHL are not like books and journals. The content is non-standard and thus often presents additional challenges for BHL staff. Prior to sending materials for digitization, archival materials must be intellectually organized, cataloged, and frequently undergo extensive preservation treatment. In doing this work, BHL’s catalogers and archivists collaborate to answer some very complex questions:
- Are these materials considered published or unpublished?
- Are they in-copyright or out-of-copyright?
- How should these materials be divided and what should comprise an intellectual unit?
The answers to the above truly vary and depend on the expertise, local digitization workflows, and the technological constraints of an organization. The sheer diversity of cataloging and digitization methods across the BHL partner network is truly astonishing, but it does not always result in precise search and retrieval for our users.
To facilitate better search precision, BHL has decided to update all archival materials to be classified as “Manuscript language material (Archival material)” which has resulted in 4,637 records being updated in the BHL database. For our users, who rely on search facets to drill down on relevant content, this ultimately means more relevant results, more archival content, and more happy BHL users! Many thanks to BHL’s Transcription Upload Tool Working Group (TUTWG) for their work and group analysis to make this major data update a reality.
Open Data Collection with new 40GB OCR Text Export
In the Fall of 2022, the BHL Technical Team worked with Keri Thompson, Data Management Specialist from Smithsonian’s OCIO’s Research Computing Office, to publish BHL data exports as FAIR data. BHL’s data has always been available for download on our Developer and Data Tools page but as an open access leader, BHL has decided to revamp each data export to include a DOI, a data dictionary, and multitude of citation formats. Most importantly, the data is now hosted on the Smithsonian Institution’s Figshare instance which serves as a “an open platform for hosting and sharing the raw material of Smithsonian research.” BHL’s data has also been cataloged using the W3C data description standard: Data Catalog Vocabulary (DCAT). In using DCAT, BHL data is now compliant and could be harvested by federally mandated data repositories like Data.gov.
Lastly, a very exciting data export has been added to the BHL Open Data Collection. Data miners everywhere, this one’s for you:
BHL Optical Character Recognition (OCR) – Full Text Export (https://doi.org/10.25573/data.21422193.v4)
The BHL OCR full text export is our largest data export, representing the full textual corpus for all items in BHL. It is so large that we really struggled to find a place to host the file. The file is updated on a monthly basis and posted to the link above.
BHL Technical Team Goes Agile
The BHL Technical Team had a transformational year. Not only did the Team complete critical upgrades, deliver exciting new features for BHL’s users, and improve overall data quality; we also adopted the popular Agile project management methodology and SCRUM framework. Agile is used by 3 out of every 4 technology projects to improve team communications, find new efficiencies, adapt to change, and maximize resources.
We are still refining new workflows but overall, the “BHL Agile experiment” has been a boon for BHL technical development. Our Team now has a birds-eye view of all of the work in the pipeline. Knowing where we’ve been, where we are going, and focusing on continuous iterative improvement has helped us deliver more meaningful value to BHL Partners and users while furthering our collective mission of making biodiversity literature openly available to the world.
Kudos to the entire BHL Technical Team and Secretariat for a productive 2022!
A special thanks to BHL’s Lead Developer, Mike Lichtenberg, and Technical Coordinator, Joel Richard, for all of your hard work, knowledge, and dedication to the BHL platform, data, and our users. BHL is so incredibly lucky to benefit from your many talents!
For a sneak peak of what is in store for the year ahead, check out the 2023 BHL Technical Priorities.
Leave a Comment