An evaluation of taxonomic name finding
Starting this past June, BHL worked with Qin Wei, a Ph.D. student in Library and Information Science at the University of Illinois Urbana-Champaign, to evaluate the taxonomic name finding software and algorithms used to identify scientific names throughout the BHL corpus. This work lead to some interesting findings, which were reported this week via poster and oral presentation at the Biodiversity Information Standards (TDWG) 2008 conference in Fremantle, Australia.
View Presentation
Methodology
- Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL (0.01% of the BHL corpus at the time of the study).
- Compared those names against OCR text, then two name finding algorithms (TaxonFinder & FAT)
Characteristics of the sample
- Number of Pages: 392
- Average Number of Words per Page: 446.8
- Average Number of Names per Page: 7.7
- Total Number of Names: 3003
- Total Number of Unique Names: 2610
OCR Errors
- Of the 3,003 names, 1,056 were incorrectly transcribed by OCR, for an error rate of
35.16%
- Top OCR errors
1 Insert Space
2 Omit Space
3 e->c
4 u->I
5 u->n
6 i->l
7 c->e
8 n->v
9 l->i
10 r->i
11 u->ii
12 h->l
13 h->ii
14 e->o
Performances of algorithms
- TaxonFinder
- Excluding names with OCR errors
- Precision 40.32%
Recall 36.62%
F-score 38.47%
- Precision 40.32%
- Including names with OCR errors
- Precision 43.77%
Recall 25.82%
F-score 34.80%
- Precision 43.77%
- Excluding names with OCR errors
- FAT
- Excluding names with OCR errors
- Precision 28.20%
Recall 23.34%
F-score 25.77%
- Precision 28.20%
- Including names with OCR errors
- Precision 32.25%
Recall 17.21%
F-score 24.73%
- Precision 32.25%
- Excluding names with OCR errors
Considerations
- Improving OCR software is out of current scope for BHL
- investigations into Tesseract may be worthwhile
- Rekeying is too expensive and will not scale
Recommendations
- Enhance “fuzzy” retrieval in algorithms
- Exception rules to overcome OCR errors
- More work needed in this space
For additional information about the study:
- Project wiki, including downloadable datasets
- TDWG Presentation
- TDWG Poster (page size)
Questions about this study should be commented below for wider visibility than e-mail correspondence.
Leave a Comment