The paper and subsequent commentary has accelerated nascent efforts at macroscopic, algorithmic questioning of large historical textual data sets. Can similar methods be applied fruitfully to the BHL corpus?
Already, Rod Page, bioinformatician and developer of BioStor, has demonstrated suggestive evidence in the affirmative by tracking a small sample of species names for the same organism in BHL texts through time and plotting the number of citations. The graph may be a visual representation of scientific debate and usage. Many other uses are possible, including:
- Co-occurrence of place names with species
- Frequency of co-occurrence of species names esp. with key words such as host, prey, predator, symbiont etc.
- Tracking trends in zoological and botanical research by tracking methodological terminology through time.
- Identification of taxonomically significant “events” in the literature based on textual cues.
Much of the follow-on activity to the Science paper is occurring in the “Digging into the Data” program. Thus, on May 9, the BHL made its data available for researchers in the Digging into the Data program.
BHL Director, Tom Garnett, will be attending the conference, “Digging into the Data” in June where speakers, including the authors of the Science article, will address issues of and opportunities in data mining of large textual corpora. With suitable partners, it is possible that we can seek NSF or Google funding for the unique use case our increasing text corpus presents. The framework for a proposal would be a team of biologists and a team of computer scientists posing research questions for the BHL corpus that would be amenable to algorithmic investigation. Even if funding is not forthcoming, if third party researchers use the BHL corpus to produce scientifically or historically salient results, it will enhance the value and use of the BHL, which can lead to further collaborations.