Friday, April 10, 2009

Improved handling of diacritics in BHL searches

I wanted to let everyone know about a change that has been made to the search function of the BHL portal.

Until now, letters that include diacritics (for example, ó, ö, è, é, û) were treated differently than letters without diacritics.

What this meant is that in order to find titles, authors, or subjects that included diacritics, you had to search for an exact match on the diacritic... for example, to find all titles about "invertebrate zoology", you had to search twice: once for "invertebrate zoology" and once for "invertebrate zoölogy". (Or you had to search for something like "invertebrate zo" and hope you didn't get too much extra stuff in the search results.) Obviously, there are all sorts of problems with this limitation.

Starting immediately, searches in the BHL portal are accent-insensitive, so no distinction is made between letters with and without diacritics. This means that a search for "invertebrate zoology" will now find all nine titles that contain either "invertebrate zoology" or "invertebrate zoölogy". See the search results here: Another good example is searches for "Linne", which now return instances of both "Linne" and "Linné".

While there is still more work to do to improve the search features, this is a good first step to improving the quality of our search results.

Mike Lichtenberg
Missouri Botanical Garden

1 comment:

  1. Francisco Welter-SchultesApril 16, 2009 at 6:45 AM

    This is a great step forward! It really helps improving the search function at BHL enormeously.
    2 other proposals:
    (1) a function is needed so that if I look for Linné I will also find Linnaeus or Linnaei. Many authors have various synonymous spellings, search engines of many library catalogues are able to tolerate this.
    (2) if I enter two words in the search field (example: "amoenitates variae"), the result should show all publications which contain both words, regardless at which position in the text. Current results only list those works where the two terms follow directly one after the other (test with searching for "amoenitates").