Biodiversity Heritage Library - Program news and collection highlights from BHL
  • Home
  • News
  • Featured Books
    • All Featured Books
    • Book of the Month Series
  • User Stories
  • Campaigns
    • Fossil Stories
    • Garden Stories
    • Monsters Are Real
    • Page Frights
    • Her Natural History
    • Earth Optimism 2020
  • Tech Blog
  • Visit BHL
Home
News
Featured Books
    All Featured Books
    Book of the Month Series
User Stories
Campaigns
    Fossil Stories
    Garden Stories
    Monsters Are Real
    Page Frights
    Her Natural History
    Earth Optimism 2020
Tech Blog
Visit BHL
  • Home
  • News
  • Featured Books
    • All Featured Books
    • Book of the Month Series
  • User Stories
  • Campaigns
    • Fossil Stories
    • Garden Stories
    • Monsters Are Real
    • Page Frights
    • Her Natural History
    • Earth Optimism 2020
  • Tech Blog
  • Visit BHL
Biodiversity Heritage Library - Program news and collection highlights from BHL

All posts in Tech Updates

Blog Reel, Tech Updates

OCR Improvements: An Early Analysis

Read the full blog post

Optical character recognition (OCR) plays a critical part in BHL’s contributions to the scientific community. OCR in and of itself is a remarkable achievement, converting images of typewritten text to computer-readable text with “pretty good” accuracy. OCR on handwritten text is an even greater challenge to address and is beyond the scope of the improvements discussed here. The scientific work that BHL supports demands the best accuracy that we can provide using available tools, and let’s be honest, available budgets.

Recently, our colleagues at the Internet Archive made the transition away from the ABBYY FineReader OCR software to the Tesseract Open Source OCR engine. Over the past year or more, the OCR team at the Internet Archive has adapted and fine-tuned Tesseract to their workflows. Our first impression is that Tesseract OCR is more than “pretty good” in its ability to identify text from the page images provided to it.

The downside to this is that the Internet Archive has rightfully chosen to not re-process all existing text content through the Tesseract OCR engine. This is a prohibitively expensive and time-consuming prospect given that they have 35 million text-based items and reprocessing them would take several years and use up resources that could otherwise be used for gathering new content.

However, in the interests of supporting the efforts of the BHL community, the BHL Tech Team is working with our Internet Archive partner to reprocess some of BHL’s oldest content with the newest available version of Tesseract OCR. We are currently in a testing phase, and this blog post details some of our early results.

Continue reading
July 19, 2022byJoel Richard
BHL News, Blog Reel, Tech Updates

New Article PDF Content Available

Read the full blog post

The BHL Tech Team is pleased to announce a new form of content available in BHL: Article PDFs. While this may not sound like anything new, after all, we have had a tool to download PDF content for some time, this update changes both how the PDFs are created and maintained, and how BHL is viewed by content aggregators on the internet, most notably Unpaywall.

Continue reading
March 14, 2022byJoel Richard
BHL News, Blog Reel, Tech Updates

What Is BHL’s New Persistent Identifier Working Group DOI’ng?

Read the full blog post

In October 2020, BHL launched a new working group with a momentous goal: to make the content on BHL persistently discoverable, citable and trackable using DOIs (Digital Object Identifiers).

A DOI is like an electronic fingerprint in the form of a unique and permanent alphanumeric string that provides a persistent link to a piece of content online. Modern publications receive a DOI at the point of publication. A DOI is a key part of a publication’s bibliographic metadata and should be included in any mention or citation of that publication. Reference lists in modern publications are filled with DOIs, which allows readers to click from publication to publication in (in theory) a never-ending chain of knowledge.

Continue reading
May 10, 2021byNicole Kearney
BHL News, Blog Reel, Tech Updates

Updates to Bibliography Pages in BHL

Read the full blog post

We have updated the bibliography pages in BHL to streamline the presentation of information about and metadata export options for content in the Library.

Continue reading
February 11, 2021byGrace Costantino
BHL News, Blog Reel, Tech Updates

BHL Improves the Speed and Accuracy of its Taxonomic Name Finding Services with gnfinder

Read the full blog post

BHL has deployed a new taxonomic name finding tool to improve the speed and accuracy of identifying names throughout its 58+ million pages.

BHL is now usingGlobal Names Architecture’s (GNA) gnfinder tool to locate taxonomic names in the BHL corpus. Prior to this deployment, BHL’s name finding services were based on an index of scientific names created by GNA developers six years ago by parsing every page in BHL one by one. This took 45 days to accomplish, and the cost of repeating this process made updating or improving the index infeasible.

The gnfinder tool uses fast, scalable programming languages to significantly reduce computational time. Using Open Source applications in Go and Scala, the tool detects candidate scientific names and compares them to millions of scientific name-strings aggregated by GNA for verification. The new process decreases the time needed for name detection and name verification from 35 days to 5 hours and from 7 days to 12 hours, respectively. As a result, the entire BHL corpus can now be indexed in less than a day, compared to the 45 days needed for the previous index. Additionally, by significantly reducing computational time, implementing iterative improvements to the index is now achievable.

Continue reading
July 21, 2020byGrace Costantino
BHL News, Blog Reel, Tech Updates

Additions to Text Exports Coming Soon

The BHL website was recently updated for new fields to download content. The TSV Data Exports are being updated to mirror this change. Please review these changes if you rely on the field order instead of the field names of the TSV file.

Continue reading
August 27, 2019byJoel Richard
BHL News, Blog Reel, Tech Updates

BHL Journal Articles Are Now Discoverable via Unpaywall

Read the full blog post

Unpaywall finds (legally) open access versions of paywalled literature. Thanks to the work of Richard Orr, Unpaywall’s Lead Developer, BHL is now one of the sources indexed in Unpaywall’s database. As of this week, 43,000 journal articles on the BHL website are now discoverable via Unpaywall.

Continue reading
August 16, 2019byNicole Kearney and Roderic D. M. Page
Page 2 of 11«1234»10...Last »

Tech Updates

Keep up with all the latest technical development news from the Biodiversity Heritage Library, including announcements of new features and improvements to library services, with our Tech Blog.
Subscribe to Tech Updates

Help Support BHL

BHL’s existence depends on the financial support of its patrons. Help us keep this free resource alive!

Donate Now

search

About BHL

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. Headquartered at the Smithsonian Libraries and Archives in Washington, D.C., BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access as part of a global “biodiversity community.”

Follow BHL

Join Our Mailing List

Sign up to receive the latest news, content highlights, and promotions.

Subscribe Now

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Subscribe to Blog Via RSS

Subscribe to the blog RSS feed to stay up-to-date on all the latest BHL posts.

Access RSS Feed

BHL on Twitter

Tweets by @BioDivLibrary

Inspiring Discovery through Free Access to Biodiversity Knowledge.

The Biodiversity Heritage Library makes it easier than ever for you to access the information you need to study and explore life on Earth…for free, anytime, anywhere.

62+ Million Pages of
Biodiversity Literature Online.

EXPLORE

Tools and Services
to Transform Research.

EXPLORE

300,000+
Illustrations on Flickr.

EXPLORE

 

ABOUT | BLOG AUTHORS | HARMFUL CONTENT | PRIVACY | SITE MAP | TERMS OF USE

Download Adobe Acrobat Reader