Tuesday, April 21, 2009

File formats for citation storage and distribution

We’ve been investigating options for storage and distribution of citation data in the Biodiversity Heritage Library. In particular, we are searching for an appropriate "core" format. The thought is that with an appropriately verbose, open, standard core format for our citations, we can transform that format into whatever other format we might want to support. By “verbose”, we mean a format that can support all of the information that we need to preserve. By “open”, we’re looking for a format that’s not tied exclusively to one system or vendor. And by “standard”, we’re hoping to identify a format that is widely recognized by the library community.

Some of the information found in this Wikipedia article has guided the research: Specifically, the information found there about which formats are supported in each of the various applications is useful.

Following is a brief description of the format candidates we’ve investigated, as well as our preliminary conclusions.

If you have experience with one or more of these formats and would like to help us make our decision, please post your comments below.

Mike Lichtenberg
Missouri Botanical Garden


  1. Quick (biased) comments.

    1. Endnote is not the same as RIS. RIS is defined here: Endnote supports this, but has it's own native format (as well as it's own XML).

    Support for RIS is essential (makes bulk import easy), pleases users. One problem is author names aren't atomised into components, so consuming software has to do this.

    2. COinS isn't very widely supported, but Zotero makes use of it, so won't hurt to use it (as you already do).

    3. METS/MODS/MARC don't exist outside of librarians' heads. Avoid like plague, few will use it. Librarians are not (or should not be) the target audience (see also OAI)

    4. NLM great for within document citations, doubt it's used outside publishers preparing manuscripts.

    5. RDF Is great, and you missed the PRISM vocabulary (1.0 is here, 2.0 is out
    See for one nice example. Problem is there are competing vocabularies and the winner isn't clear (although publishing industry mostly use Dublin Core and PRISM).

    6. UniXRef I use this as I consume CrossRef services, but not aware of anyone using it outside of CrossRef.

    7. BibTeX is great for LaTeX users like me, but probably not great as a core format

    8. OAI sucks, far too vague, allows users to store poorly formatted junk in metadata fields (just take a look at any DSpace OAI output).

    Personally, I'd pick one of Endnote XML, NLM XML, or UniXRef, and then generate alternative formats using XSLT. Make core XML, RIS, and COinS available to users.

  2. Perhaps a simple poll when all's said and done would be a good idea?

    I have some reservations about BHL hooking up their cart to someone else's horse. - you'd need to make sure you're headed in the same direction - i.e. How aligned is BHL with the types of resources that are described by each of these formats?

    I generally agree with Rod's comments. My personal prefs. are based around familiarity:

    1) Unixref
    2) PubMed
    3) Prism

    What I'm looking for in a file format for citation storage is something which is granular enough, and supports a wide range of resource types, yet is fairly readily understood by looking at a couple of examples.

    I'm glad you haven't considered RDA (have you seen the docs. for it?). NLM and NAL are "looking at it"

  3. Okay, sorry NLM > PubMed, but you get the idea.

    I'll be interested to see how many strong candidates for BHL needs there are!

  4. Rod, Tom - Exactly the feedback we wanted! Agreed, Rod, we want to pick one core format and XSLT to everything else. Do either of you have opinions on the "openness" of UniXRef or EndNote XML, or is that even a concern here given that both have a published schema?

  5. Chris,

    I suppose what I was alluding to earlier is also to do with versions of the EndNote, etc. XML - how will you determine whether it's worth migrating from one version of an XML format to another?

    I'm not sure I have anything to add about the "openness" - Thomson Reuters would be the people to talk to (assuming you haven't already). I suppose I'm personally more likely to feel uneasy about going with a corporate entity's format, that with that of a group such as CrossRef or a more "public good" entity such as NLM.

    Ultimately though, it's your call, and you need to go with the best fit.

    I take it you're still implementing an OAI interface - would you XSLT to that too, or will you also support the native format you choose for storage e.g. NLM as well as OAI_DC?

  6. Tom, yes, we would support both our 'core' format and OAI_DC in OAI.

    I'm thinking that NLM is our best pick. None of the schema contain some of the weird stuff we encounter in historic biodiversity literature, so *any* and *all* of them have deficiencies in meeting our requirements at 100%. I'm favoring NLM because 1) it's widely used by the contemporary publishing industry, including PLoS, and bringing historic publications into a modern publishing environment is really, really cool, and 2) the GoldenGate/Plazi folks have been in discussion with NLM about a taxonomic extension to the schema that would accommodate taxonomic/nomenclatural acts. That's huge for us.

    For those reasons, I think I've convinced myself that our core format should be NLM, then XSLT to everything everyone wants.

    Have I convinced you??


  7. Hi Chris,

    Yes, I'm happy with NLM... I think they do great work, and if they're working on a "taxonomic extension to the schema that would accommodate taxonomic/nomenclatural acts" which is good for BHL, then I can't complain!!!


  8. Hi Chris,

    NLM looks very comprehensive for citations at title or item level. I would advise using METS/MODS too
    as it is an open digital library standard; most of us are coming from MARC metadata background (so easy to migrate to that) and the additional admin/technical data METS holds allows a more complete solution to the metadata issue.

    I also support COINS. I didn't realise we were already using this at item level (cool!) but would like to see it extended to the BHL hit list level if that's doable within the standard. Here's why: With zotero, if you get a hitlist, you could pull all the hits into Zotero in one go. Also, if configuring a federated search system (e.g Metalib which we are building for EDIT, a project which has connections to BHL-E), we can instantly create a combined BHL + JSTOR (or database of your choice system) search, because there is metadata to grab. This could be a very powerful research tool potentially.

  9. NLM looks pretty good to me. WRT BibTeX - my understanding is that most programs that'll read RIS will also read BibTeX - being the oldest format. It also looks like a lot of the heavy XSLT lifting's been done by a project called BibTeXML - from this page on the project site:
    Presentation and exchange with XSLT

    XSLT specs for converting BibTeX XML markup to BibTeX (LaTeX syntax), Dublin Core, MODS, RIS, DocBook, LaTeX biblist environment, or HTML using Harvard, Chicago or APA citation styles. There is also a testbed for generating HTML with links to the bookstore.

  10. A further reason to go with the NLM format is that BHL has already digitized and will in the future digitize back issues of journals that are currently being published in the NLM DTD, e.g. all the BioOne journals. Making the older journal literature _easily_ interoperable with current and future will enable new services. If BHL supports this, content holders will be more willing to deposit content with us.

  11. Oi. Parabéns por seu excelente blog. Gostaria de lhe convidar para visitar meu blog e conhecer alguma coisa sobre o Brasil. Abração

  12. Well, Endnote is not open in the sense that the whoever owns endnote (I don't know who it is nowadays) can change that at any time---and then what, we will be using an outdated version of it? Also, this primarily is intended for use by Endnote users.

    I don't see how Dublin core is going to have enough resolution to describe the kinds of works in the BHL.

    Perhaps because I am a librarian, I don't share Rod's objections to METS/MODS/MARC. These have good resolution and there are good semantical guidelines for what each field should contain. I gather than many people don't agree with me. I suppose this is an issue for another day.

    I don't see what the BHL wants out of this file format---a core storage format, that is then transformed into others for distribution as users need it, or else a single format that is good for both storage and for distribution.

    I wonder if you might consider the bibliographic ontology?

  13. There's an issue for taxonomic literature that I haven't seen discussed yet: the requirement to be able to mark genus/species names as italic in titles. It seems trivial, but it's not. If that's not an intrinsic feature of the bibliographic format, then manual correction would be needed for publication use of any references. Ouch.

    For that reason, I'm putting in a vote for NLM as the format. It does have explicit markup for character emphasis in titles (including italics).

    After an hour of peering at METS/MODS and PRISM documentation, I can't see anything like that in either of those (but I'm definitely willing to be corrected on that).

    I'd vote against using Endnote-related formats (XML or otherwise) as a core format (though their XML format does include markup for italics). The only advantage would be easy interoperability with Endnote software itself. But that could be broken by Thompson Reuters at any time, leaving us with an "orphaned" format that would require transformation for Endnote use anyway.

  14. I'm one of the editors of the bibliography ontology RDF spec. Really depends on what your data needs are, but RDF is designed for extensibility, and bibo is a nice balance of simplicity and rigor (if I do say so myself); better than the other alternatives.

    I've posted a Python bibo/rdf-object mapping here if you or your programmers are curious.

  15. Oh, BTW, Dean raises an interesting point about inline markup. This is an issue in other fields as well.

    RDF has support for the concept in its XML Literal, but you still need a way to markup it up. As I and other have been thinking about this more on the formatting output end (I wrote and maintain the CSL schema for citation config, which is what Zotero and Mendeley use), I've tended to think a small subset of HTML ought to be enough.

  16. Unixref, PubMed, or Prism are all good options. OAI does not serve the purpose at all, like Rod stated.