BHL News, Blog Reel, Tech Updates

An evaluation of taxonomic name finding

Starting this past June, BHL worked with Qin Wei, a Ph.D. student in Library and Information Science at the University of Illinois Urbana-Champaign, to evaluate the taxonomic name finding software and algorithms used to identify scientific names throughout the BHL corpus. This work lead to some interesting findings, which were reported this week via poster and oral presentation at the Biodiversity Information Standards (TDWG) 2008 conference in Fremantle, Australia.

View Presentation

An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments

Methodology

Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL (0.01% of the BHL corpus at the time of the study).
Compared those names against OCR text, then two name finding algorithms (TaxonFinder & FAT)

Characteristics of the sample

Number of Pages: 392
Average Number of Words per Page: 446.8
Average Number of Names per Page: 7.7
Total Number of Names: 3003
Total Number of Unique Names: 2610

OCR Errors

Of the 3,003 names, 1,056 were incorrectly transcribed by OCR, for an error rate of
35.16%
Top OCR errors
1 Insert Space
2 Omit Space
3 e->c
4 u->I
5 u->n
6 i->l
7 c->e
8 n->v
9 l->i
10 r->i
11 u->ii
12 h->l
13 h->ii
14 e->o

Performances of algorithms

TaxonFinder
- Excluding names with OCR errors
  - Precision 40.32%
    Recall 36.62%
    F-score 38.47%
- Including names with OCR errors
  - Precision 43.77%
    Recall 25.82%
    F-score 34.80%
FAT
- Excluding names with OCR errors
  - Precision 28.20%
    Recall 23.34%
    F-score 25.77%
- Including names with OCR errors
  - Precision 32.25%
    Recall 17.21%
    F-score 24.73%

Considerations

Improving OCR software is out of current scope for BHL
- investigations into Tesseract may be worthwhile
Rekeying is too expensive and will not scale

Recommendations

Enhance “fuzzy” retrieval in algorithms
- Exception rules to overcome OCR errors
More work needed in this space
- More evaluations & experiments
- Robust training sets
  - reCAPTCHA for names?

For additional information about the study:

Project wiki, including downloadable datasets
TDWG Presentation
TDWG Poster (page size)

Questions about this study should be commented below for wider visibility than e-mail correspondence.

October 20, 2008

Written by Chris Freeland

Chris Freeland served as the BHL Technical Director from 2006-2012. He is currently the Director of the Open Libraries program at Internet Archive. In this capacity he works with libraries & publishers to digitize their collections, working towards the Archive’s mission of providing “universal access to all knowledge.”

Cancel Reply

About BHL

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. Headquartered at the Smithsonian Libraries and Archives in Washington, D.C., BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access as part of a global “biodiversity community.”

An evaluation of taxonomic name finding

Related Posts

Leave a Comment

Cancel Reply

Help Support BHL

Search

About BHL

Follow BHL

Join Our Mailing List

Subscribe to our Blog Via RSS

An evaluation of taxonomic name finding

Related Posts

The Collector Connection: United States Geological Survey

BHL Members’ Council Elects New Executive Committee

More BHL material online

Leave a Comment

Cancel Reply

Help Support BHL

Search

About BHL

Follow BHL

Join Our Mailing List

Subscribe to our Blog Via RSS