An evaluation of taxonomic name finding

Starting this past June, BHL worked with Qin Wei, a Ph.D. student in Library and Information Science at the University of Illinois Urbana-Champaign, to evaluate the taxonomic name finding software and algorithms used to identify scientific names throughout the BHL corpus. This work lead to some interesting findings, which were reported this week via poster and oral presentation at the Biodiversity Information Standards (TDWG) 2008 conference in Fremantle, Australia.

View Presentation

Methodology

  • Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL (0.01% of the BHL corpus at the time of the study).
  • Compared those names against OCR text, then two name finding algorithms (TaxonFinder & FAT)

Characteristics of the sample

  • Number of Pages: 392
  • Average Number of Words per Page: 446.8
  • Average Number of Names per Page: 7.7
  • Total Number of Names: 3003
  • Total Number of Unique Names: 2610

OCR Errors

  • Of the 3,003 names, 1,056 were incorrectly transcribed by OCR, for an error rate of
    35.16%
  • Top OCR errors
    1 Insert Space
    2 Omit Space
    3 e->c
    4 u->I
    5 u->n
    6 i->l
    7 c->e
    8 n->v
    9 l->i
    10 r->i
    11 u->ii
    12 h->l
    13 h->ii
    14 e->o

Performances of algorithms

  • TaxonFinder
    • Excluding names with OCR errors
      • Precision 40.32%
        Recall 36.62%
        F-score 38.47%
    • Including names with OCR errors
      • Precision 43.77%
        Recall 25.82%
        F-score 34.80%
  • FAT
    • Excluding names with OCR errors
      • Precision 28.20%
        Recall 23.34%
        F-score 25.77%
    • Including names with OCR errors
      • Precision 32.25%
        Recall 17.21%
        F-score 24.73%

Considerations

  • Improving OCR software is out of current scope for BHL
    • investigations into Tesseract may be worthwhile
  • Rekeying is too expensive and will not scale

Recommendations

  • Enhance “fuzzy” retrieval in algorithms
    • Exception rules to overcome OCR errors
  • More work needed in this space
    • More evaluations & experiments
    • Robust training sets
      • reCAPTCHA for names?
      • View Full Size Image

For additional information about the study:

Questions about this study should be commented below for wider visibility than e-mail correspondence.

Avatar for Chris Freeland
Written by

Chris Freeland served as the BHL Technical Director from 2006-2012. He is currently the Director of the Open Libraries program at Internet Archive. In this capacity he works with libraries & publishers to digitize their collections, working towards the Archive’s mission of providing “universal access to all knowledge.”