The Power of Community Science: How Smithsonian Volunpeers Transform Scientific Field Notes

Last month, Smithsonian Libraries and Archives (SLA), Smithsonian Transcription Center (STC), and the Biodiversity Heritage Library (BHL) celebrated a significant milestone – technical staff worked collaboratively to integrate over 43,000 pages of transcription materials from STC into BHL. An additional 151,362 scientific name access points have now been added to the BHL search index for SLA archival field notes. These transcriptions enhance BHL’s full-text search, enable taxonomic name recognition, improve accessibility for vision-impaired users, and support climate research.

The Smithsonian Field Books Collection

Yellow notebook with handwritten text and handdrawn map

Dall, William Healey. (1895). Alaska Coal Fields, Bogosloff Volcano, Corral Hollow, California, 1895 (Vol. 1, pp. 81 and 82).

The Smithsonian Field Books Collection is a set of primary source archival records selected from the Smithsonian Libraries and Archives (SLA). This material dates from the late nineteenth century and the first comprehensive biological survey of the continental United States to the most recently accessioned materials at the Smithsonian Institution Archives.

The collection includes personal records of naturalists and scientists such as William Healey Dall (1845 – 1927) and Cleofe Calderon (1929 – 2007) at work around the world and expedition records such as the Western Union Telegraph Expedition (1865 – 1867) of Russian America and the United States Exploring Expedition (1838 – 1842) of the Pacific Ocean.

Naturalists and scientists in the field recorded their firsthand observations and data in a wide variety of forms including diaries and journals, hand-drawn maps and tables of data, observation logs and specimen catalogs, correspondence and reports, manuscripts, sketches, photographs, even audio recordings.

Recognizing the Value of Accurate Transcription

Mining the rich information embedded in these field notes depends on accurate transcription. The majority of the field notes are handwritten – even the most recent ones. Digital surrogates provide high resolution images but fail to afford anything beyond visual accessibility. Transcription, however, opens up this material to full-text searching, pattern recognition, visual accessibility aids, and more.

Beyond the handwriting itself, the field notes also contain other non-textual information that can only be captured through transcription. Examples include ornithologist Martin Moynihan’s notations of bird song in remote regions of Central America or Charles Dolittle Walcott’s sketches and diagrams of sedimentary stratification where his fossil specimens were found in the canyons of the American Southwest.

Handwritten notes with a drawing of a bird

Moynihan, M. (1963). 1962-1965 Andean Birds Mixed Flocks, Colombia, (4 of 4) (p. 78).

Chief naturalist with the U.S. Department of Agriculture, Vernon Orlando Bailey’s “Journal kept by Bailey on field trip to Wyoming and New Mexico, March 15-June 1906” focused on extermination techniques of gray wolves that would bring them to near extinction in the continental United States not long afterwards. The transcript includes descriptions of the sketches he included in his notes.

Screenshot of Smithsonian Transcription Center displaying handwritten journal next to transcribed text

Smithsonian Transcription Center image with corresponding transcription, including a text description of an ink drawing of a cow. https://transcription.si.edu/view/6656/EBOtg

The Smithsonian Field Book Project’s initial goal was to catalog these hidden biodiversity research material, improving discoverability in keeping with FAIR (Findable Accessible Interoperable Reusable) Data Principles. Doing so quickly resulted in additional researcher demand for more access and usability – first to view the field notes online (digitization) and then to examine them more closely (transcription). Grant funding assisted in a major rapid-capture digitization effort. However when it came to transcribing the digitized field notes, the Archives simply lacked the capacity to meet the level of researchers’ demand.

Turning to digital volunteers and the just-launched Smithsonian Transcription Center in 2013 changed the transcription equation beyond our best expectations both in volume and in accuracy. We quickly came to recognize these community scientists as collaborators, “volunpeers”, in the effort to advance and disseminate knowledge.

We are far from the end of this journey. More than half the collection remains to be digitized, and over two thousand digitized field notes still need transcription.

Inspired to Transcribe

These archival field notes contain vital historic biodiversity information. By transcribing the handwriting into machine readable text, volunpeers can help inform current day scientists and assist with their research on a multitude of topics such as climate change, the extinction crisis, or the spread of invasive species. Transcribing field notes can also take volunpeers on an adventure across time and distance, accompanying the writer on their journey. Having these adventures transcribed in machine readable text can also help inform science historians assisting them by making these historic documents more easily findable, searchable, and reusable.

Volunpeers collaborate to transcribe as accurately as possible the pages of the field journals provided. Multiple volunpeers will work on each project page transcribing the hard-to-read handwriting into machine readable text. Once satisfied the transcription is as complete as possible, a volunpeer will mark the transcribed page as complete. The volunpeers will then move onto the next page of the field notes until the project is finished.

Approval workflow for Smithsonian Transcription Center

Review is the second step in the transcription process.

Having many volunpeers working on one project helps ensure the quality of the transcription. What one volunpeer finds illegible, others may be able to read, especially as all the volunpeers working on a particular project become more familiar with the handwriting.

Volunteers have transcribed hundreds of field journals over the years. Two favorite examples have been the field journals of Vernon Orlando Bailey, a field naturalist who journeyed throughout the U.S. midwest studying and collecting mammals. Another volunpeer favorite were the papers of Arctic explorer and naturalist Robert Kennicott. Read more about these volunpeer experiences on the STC blog.

The Power of the Smithsonian Transcription Center

When the first of the field books were added to Smithsonian Transcription Center (STC), the program was still available only as a beta version. Approximately 450 volunpeers were transcribing on the site (co-author Siobhan Leachman among them), with the first 6,000 completed pages under their belt. Even during this moment of immense energy and fresh connections, it must have been difficult to imagine what STC would become. Today, a little over a decade later, more than 91,000 individual volunpeers have worked together to transcribe and review over 1.4 MILLION (!) pages of historic and scientific collections.

STC is the largest digital volunteering and crowdsourcing program at the Smithsonian Institution, and provides opportunities to engage with and contribute to digitized materials from across the full breadth of content areas represented by its museums, archives, and libraries. Through collaborative transcription and review, Smithsonian staff and digital volunteers work together to ensure that this content is more readable, accessible, and text-searchable across Smithsonian data systems and beyond.

Currently, the most active and popular projects are the Freedmen’s Bureau Transcription Project, a collaboration with the National Museum of African American History and Culture that deepens insight into the Reconstruction period and empowers African American genealogical research, and Project PHaEDRA, a collaboration with the Harvard-Smithsonian Center for Astrophysics that illuminates the work and discoveries of early women computers at the Harvard College Observatory.

If you feel inspired by this data access success story, consider joining the digital volunteer community, or sign up for our newsletter to stay up-to-date on upcoming projects.

Reusing the Liberated Data

The successful integration of transcription materials into BHL makes historical scientific data more accessible and useful. Through the collaborative efforts of technical staff from the Biodiversity Heritage Library, Smithsonian Libraries and Archives, and the Smithsonian Transcription Center and dedicated volunpeers, over 151,362 scientific name access points were added to the BHL search index, greatly enhancing search capabilities over the digitized corpus of Smithsonian field notes and archives. Out of the 556 eligible items reviewed, 522 were uploaded, contributing 43,460 pages of improved OCR text. These contributions not only improve BHL’s full-text search and taxonomic name recognition services but also provide better accessibility for vision-impaired users, and support ongoing biodiversity and climate change research.

Smithsonian Transcription Data outcomes report

A special thanks goes to Mike Lichtenberg, BHL’s Lead Developer and Systems Architect and Paul Day, Lead Developer at Smithsonian Transcription Center. As with many data improvement and platform enhancement projects, the requisite technical expertise is pivotal in ensuring the success of our collective efforts!


References and Resources

Dearborn, J., Lichtenberg, M., Richard, J. M., deVeer, J., Trizna, M., & Mika, K. 2023. [Presentation] Unearthing the Past for a Sustainable Future: Extracting and Transforming Data in the Biodiversity Heritage Library for Climate Action. Presented virtually at TDWG, Tasmania, Australia 2023. https://www.youtube.com/watch?v=8sGssyrpuJw

Trizna, M., & Dearborn, J. June 2023. [Poster] AI Models Are Getting Better at Reading Handwriting, but How Can We Find Handwritten Text to Begin With?. 7th Annual Digital Data Conference, Leveraging Digital Data for Conservation, Ecology, Systematics, and Novel Biodiversity Research, Tempe, Arizona, United States of America. https://doi.org/10.25573/data.23523495.v1

Dearborn, J., & Mika, K. June 5, 2022. [Poster] Extracting Expedition Log Data Found in the Biodiversity Heritage Library. Through the Door and Through the Web: Releasing the Power of Natural History Collections Onsite and Online, Edinburgh, Scotland, United Kingdom: Society for the Preservation of Natural History Collections (SPNHC). https://doi.org/10.5281/zenodo.6593457

Ricc Ferrante has a keen passion for making hidden primary source collections accessible and usable online. From the inception of the Field Book Project in 2009 and the development of the Smithsonian Transcription Center in 2013, to the contribution of these resources to the Biodiversity Heritage Library, he continues to look for new opportunities to advance the accessibility of these and similar collection material to the global research community. He is the Associate Director of Information Systems and Digital Lifecycle.

Siobhan Leachman is a Smithsonian Transcription volunpeer who has a special interest in transcribing field notes. She is also a BHL volunteer and wikimedian working on improving the presence and use of the Biodiversity Heritage Library content, which includes the Smithsonian field journals, into Wikipedia and Wikidata.

Emily Cain specializes in building community around and connections to museum collections and information. As Smithsonian Transcription Center's Community Manager, she oversees user needs, site content, public engagement and programming, and currently serves as Project Manager for the comprehensive rebrand and redesign of the program and website. Emily holds a BA in Anthropology from Marshall University, and an MA in Museum Studies from George Washington University.

Mike Trizna is a data scientist at the Smithsonian Institution in the OCIO Data Science Lab, where he collaborates with researchers and collections staff across the many units of the Smithsonian to pilot responsible applications of AI to their current work. Previous to joining the Data Science Lab, Mike worked for 12 years at the Smithsonian National Museum of Natural History as a bioinformatician.

JJ Dearborn joined the Biodiversity Heritage Library as Data Manager in 2022 and works to open-up BHL data to the larger biodiversity community and the world. As a longtime advocate for the free-culture movement, she has worked on open access projects for the Peabody Essex Museum, Harvard University’s Department of Organismic and Evolutionary Biology, the Smithsonian Museum of Natural History, Harvard-Smithsonian Center for Astrophysics, the City of Boston, and the State of Massachusetts.