As I am wrapping up a short 10-week virtual internship with the Biodiversity Heritage Library, I believe that I will look back on this experience as one of those pivotal opportunities that impact the course of life.
The internship was made possible by the LIS Education and Data Science for the National Digital Platform (LEADS-4-NDP) fellowship program through Drexel University and funded by the Institute of Museum and Library Services, in collaboration with BHL. LEADS-4-NDP has a mission to improve library services by providing LIS (Library and Information Science) educators and researchers with data science skills.
BHL offered an ideal data science challenge for a future LIS scholar like me to explore: a huge dataset of documents spanning five centuries, along with an engaged community of users and researchers interested in extracting new knowledge from these fascinating texts!
In particular, the summer internship focused on exploring methods of identifying references to geographic places within the text of BHL collections, along with ways to visualize these references. This project envisions eventual new library services – perhaps a browse-able map interface where BHL users could locate complex information spatially, as well as generating new biodiversity data for conducting research on distribution of species over time.
As a PhD student at the University of Arizona’s School of Information, I am currently working on a dissertation extracting information from astronomy literature. Although I study data curation from a social science perspective and am particularly focused on astronomy, I wanted to enhance my technical skills in order to contribute to the evolution and relevance of LIS, and to accomplish my own research. LEADS-4-NDP and BHL provided a crash-course and hands-on learning opportunity, and by participating in this program I was able to pull my eyes from the cosmos for awhile and focus on the incredible biodiversity here on Earth.
Through this internship, I attended a two-day data science bootcamp, which was a priceless opportunity to quickly learn valuable techniques with a cohort of 9 other LEADS Fellows. I was also introduced to relevant researchers and projects associated with BHL. After a summer of investigation and experimentation, I created a report with recommendations for a full-scale geoparsing effort based on strategies and resources I examined. I also used Python programming tools to map and compare place names extracted from BHL text using several different human and automated methods. I evaluated and annotated a test corpus of BHL documents that could be used to develop customized Named Entity Recognition for a much larger set of documents. And I configured a software environment for future human annotators.
Although this was a virtual internship and I did not visit Smithsonian in person – unfortunately sweltering in Tucson, Arizona instead – my supervisor Carolyn Sheffield was extremely supportive. I looked forward to our weekly Skype conversations, where we shared ideas and brainstormed. Carolyn has been an ideal mentor, making connections with other projects and frequently sending along helpful information. I am grateful to Carolyn and to Martin Kalfatovic, along with the LEADS faculty and other Fellows, as well as everyone who kindly met with me this summer.
Overall, the internship has illuminated that this is an exciting time to be involved in library science, as our knowledge of the universe – and the technology to research it – rapidly expand. It is fascinating to view printed words as a source of new insight, and when you think of it that way – the library itself is actually an observatory for collecting data and deciphering the mysteries of the natural world!