Biodiversity Heritage Library - Program news and collection highlights from BHL
  • Home
  • News
  • Featured Books
    • All Featured Books
    • Book of the Month Series
  • User Stories
  • Campaigns
    • Fossil Stories
    • Garden Stories
    • Monsters Are Real
    • Page Frights
    • Her Natural History
    • Earth Optimism 2020
  • Tech Blog
  • Visit BHL
Home
News
Featured Books
    All Featured Books
    Book of the Month Series
User Stories
Campaigns
    Fossil Stories
    Garden Stories
    Monsters Are Real
    Page Frights
    Her Natural History
    Earth Optimism 2020
Tech Blog
Visit BHL
  • Home
  • News
  • Featured Books
    • All Featured Books
    • Book of the Month Series
  • User Stories
  • Campaigns
    • Fossil Stories
    • Garden Stories
    • Monsters Are Real
    • Page Frights
    • Her Natural History
    • Earth Optimism 2020
  • Tech Blog
  • Visit BHL
Biodiversity Heritage Library - Program news and collection highlights from BHL

All posts by Joel Richard

Blog Reel, Tech Updates

OCR Improvements: An Early Analysis

Read the full blog post

Optical character recognition (OCR) plays a critical part in BHL’s contributions to the scientific community. OCR in and of itself is a remarkable achievement, converting images of typewritten text to computer-readable text with “pretty good” accuracy. OCR on handwritten text is an even greater challenge to address and is beyond the scope of the improvements discussed here. The scientific work that BHL supports demands the best accuracy that we can provide using available tools, and let’s be honest, available budgets.

Recently, our colleagues at the Internet Archive made the transition away from the ABBYY FineReader OCR software to the Tesseract Open Source OCR engine. Over the past year or more, the OCR team at the Internet Archive has adapted and fine-tuned Tesseract to their workflows. Our first impression is that Tesseract OCR is more than “pretty good” in its ability to identify text from the page images provided to it.

The downside to this is that the Internet Archive has rightfully chosen to not re-process all existing text content through the Tesseract OCR engine. This is a prohibitively expensive and time-consuming prospect given that they have 35 million text-based items and reprocessing them would take several years and use up resources that could otherwise be used for gathering new content.

However, in the interests of supporting the efforts of the BHL community, the BHL Tech Team is working with our Internet Archive partner to reprocess some of BHL’s oldest content with the newest available version of Tesseract OCR. We are currently in a testing phase, and this blog post details some of our early results.

Continue reading
July 19, 2022byJoel Richard
BHL News, Blog Reel, Tech Updates

New Article PDF Content Available

Read the full blog post

The BHL Tech Team is pleased to announce a new form of content available in BHL: Article PDFs. While this may not sound like anything new, after all, we have had a tool to download PDF content for some time, this update changes both how the PDFs are created and maintained, and how BHL is viewed by content aggregators on the internet, most notably Unpaywall.

Continue reading
March 14, 2022byJoel Richard
BHL News, Blog Reel, Tech Updates

Additions to Text Exports Coming Soon

The BHL website was recently updated for new fields to download content. The TSV Data Exports are being updated to mirror this change. Please review these changes if you rely on the field order instead of the field names of the TSV file.

Continue reading
August 27, 2019byJoel Richard
BHL News, Blog Reel, Tech Updates

BHL Participates in the Global Names Workshop

Read the full blog post

The Global Names Project held a workshop on 17-19 June 2019 on the Campus of the University of Illinois at Urbana-Champaign. The workshop was titled Scientific names indexing and data mobilization of Biodiversity Heritage Library using tools from Global Names project and was hosted by the Species File Group at the  Illinois Natural History Survey. Eighteen people attended representing a variety of organizations interested in BHL content: Global Names Architecture, iDigBio, TaxonWorks, UIUC Species File Group, the Illinois Library, Encyclopedia of Life, the DINA Project, the Catalogue of Life, GBIF, Species File Group Argentina, the HathiTrust Research Center, and Global Biotic Interactions.

Continue reading
July 15, 2019byJoel Richard, Matt Yoder, Deborah Paul, Jorrit Poelen and Mike Lichtenberg
BHL News, Blog Reel, Tech Updates

BHL Moves to HTTPS

HTTPS? What does it mean? HTTP is the language that your browser uses to communicate to BHL and the S stands for Secure, encrypted, unreadable, or at least much, much harder to read.The web is moving to encrypted connections across the board. In 2014 Google announced that their page rank algorithm that decides the order of your google.com search results will now rank insecure pages slightly lower than secure pages. From security to rankings, encrypted connections are better for everyone.

Continue reading
October 3, 2017byJoel Richard
BHL News, Blog Reel, Tech Updates

Information about Upcoming Changes to BHL API

Read the full blog post
The BHL API will be updated on 25 July 2016 to support changes to the BHL site. These changes will accommodate identifying additional Contributors for Items and Parts of items. First are changes to the API that may affect your existing processes. The Contributor and ContributorID elements in the result sets of API methods that return “Part” information will move. ContributorID will be included as a PartIdentifier in the Identifiers list. Contributor will be included in a new Contributors list. These changes are being made to accommodate more than one contributor per part.
Continue reading
July 12, 2016byJoel Richard

Help Support BHL

BHL’s existence depends on the financial support of its patrons. Help us keep this free resource alive!

Donate Now

search

About BHL

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. Headquartered at the Smithsonian Libraries and Archives in Washington, D.C., BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access as part of a global “biodiversity community.”

Follow BHL

Join Our Mailing List

Sign up to receive the latest news, content highlights, and promotions.

Subscribe Now

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Subscribe to Blog Via RSS

Subscribe to the blog RSS feed to stay up-to-date on all the latest BHL posts.

Access RSS Feed

BHL on Twitter

Tweets by @BioDivLibrary

Inspiring Discovery through Free Access to Biodiversity Knowledge.

The Biodiversity Heritage Library makes it easier than ever for you to access the information you need to study and explore life on Earth…for free, anytime, anywhere.

62+ Million Pages of
Biodiversity Literature Online.

EXPLORE

Tools and Services
to Transform Research.

EXPLORE

300,000+
Illustrations on Flickr.

EXPLORE

 

ABOUT | BLOG AUTHORS | HARMFUL CONTENT | PRIVACY | SITE MAP | TERMS OF USE

Download Adobe Acrobat Reader