BHL Adds Functionality Allowing Partners to Upload Crowdsourced Transcriptions of Digitized Archival Materials

The Biodiversity Heritage Library (BHL) has added functionality to allow BHL Partners to upload transcriptions in place of the automatically-generated OCR (Optical Character Recognition) for archival materials digitized in BHL. This functionality supports transcriptions generated as part of Partner crowdsourcing projects on Smithsonian Transcription Center, DigiVol, and From the Page.

Optical Character Recognition (OCR), also called text recognition, translates text characters in scanned documents into code that can be used for data processing and enables searching of document text. Handwritten archival materials like correspondence and field notes are notoriously problematic for OCR software. Full-text searching of these materials is significantly hampered by poor OCR output.

Screenshot of a digital library book viewer with gibberish for OCR.

Example of the poor automatically-generated OCR output for handwritten correspondence. Spencer Fullerton Baird and John Torrey correspondence, 1851-1860. Contributed in BHL from the LuEsther T. Mertz Library of The New York Botanical Garden.

Crowdsourcing the transcription of archival materials has become a popular way to generate machine-readable text that enables searching and discoverability. Several BHL Partners are using crowdsourcing platforms (e.g. Smithsonian Transcription Center, DigiVol, and From the Page) to transcribe field notes, correspondence, and other archival materials that they have digitized in BHL.

Screenshot of a project in the Smithsonian Transcription Center.

Example of a field notebook being transcribed in the Smithsonian Transcription Center. This notebook, Brasil 1979, Amazonia #3, from Cleofé Calderón is also available in BHL from the Smithsonian Institution Archives.

With this new functionality, these transcriptions can now be uploaded in place of the automatically-generated OCR for these items, allowing them to be full-text searchable and enabling our taxonomic name recognition software to index scientific names within their pages. Since the transcribed text can be viewed alongside the digitized page image, users can also more easily read materials with difficult-to-decipher handwriting. Thus, this new functionality makes it easier for researchers and the public to explore these valuable primary source materials and access specific information from their pages.

Screenshot of a digital library book viewer with readable OCR.

Above example with the OCR replaced with a crowdsourced transcription generated as part of The John Torrey Papers project from The New York Botanical Garden on From the Page. Spencer Fullerton Baird and John Torrey correspondence, 1851-1860. Contributed in BHL from the LuEsther T. Mertz Library of The New York Botanical Garden.

Screenshot of the BHL book viewer with a digitized field book and the transcribed text shown alongside the page in place of the OCR.

Since the transcribed text can be viewed alongside the digitized page image in the BHL book viewer, users can also more easily read archival materials. William Healey Dall’s Field Notes, 1871. Transcription generated on the Smithsonian Transcription Center. Contributed in BHL from the Smithsonian Institution Archives.

Screenshot of a book viewer with digitized archival materials and full-text search for "Yellow Palm Warbler".

Crowdsourced transcriptions allow digitized archival materials in BHL to be full-text searchable, as shown in this example searching for “Yellow Palm Warbler” within William Brewster’s 1903 journal. Transcription generated as part of the Ernst Mayr Library of Harvard University project on DigiVol. Contributed in BHL from the Ernst Mayr Library of the Museum of Comparative Zoology at Harvard University.

Screenshot of a book viewer with scientific name indexed on the page of an archival notebook.

Crowdsourced transcriptions allow BHL’s taxonomic name recognition software to index scientific names within the pages of digitized archival materials, as seen in this example in which Catharacta antarctica is indexed on a page within the first volume of the Ornithological Field Diaries of A. Graham Brown. Transcription generated as part of a project from BHL Australia and Museums Victoria on DigiVol. Contributed in BHL from Museums Victoria.

Participating Partners have begun uploading transcriptions to BHL. To date, transcriptions have been uploaded from Partner crowdsourcing projects with BHL Australia, Ernst Mayr Library of Harvard University, The New York Botanical Garden, and Smithsonian Institution Archives. This is an ongoing process, and more transcriptions will be uploaded to the Library over time.

Interested in transcribing archival materials? Several BHL Partners have active transcription projects on various crowdsourcing platforms. Follow the links below to explore the opportunities and get involved:

Avatar for Grace Costantino
Written by

Grace Costantino served as the Outreach and Communication Manager for the Biodiversity Heritage Library from 2014 to 2021. In this capacity, she developed and managed BHL's communication strategy, oversaw social media initiatives, and engaged with the public to excite audiences about the wealth of biodiversity heritage available in BHL. Prior to her role as Outreach and Communication Manager, Grace served as the Digital Collections Librarian for Smithsonian Libraries and as the Program Manager for BHL.