File formats for citation storage and distribution
We’ve been investigating options for storage and distribution of citation data in the Biodiversity Heritage Library. In particular, we are searching for an appropriate “core” format. The thought is that with an appropriately verbose, open, standard core format for our citations, we can transform that format into whatever other format we might want to support. By “verbose”, we mean a format that can support all of the information that we need to preserve. By “open”, we’re looking for a format that’s not tied exclusively to one system or vendor. And by “standard”, we’re hoping to identify a format that is widely recognized by the library community.
Some of the information found in this Wikipedia article has guided the research: http://en.wikipedia.org/wiki/Comparison_of_reference_management_software. Specifically, the information found there about which formats are supported in each of the various applications is useful.
Following is a brief description of the format candidates we’ve investigated, as well as our preliminary conclusions.
- The following formats appear (at the first look) to be the most open, verbose, and recognized formats.METS/MODS – Library of Congress standardshttp://www.loc.gov/standards/mets/
http://www.loc.gov/standards/mods/ – examples can be found under the “Guidance” section
NLM – National Library of Medicine format
http://dtd.nlm.nih.gov/ – DTDs
http://www.ncbi.nlm.nih.gov/staff/beck/citations/citationtags.html – examples
EndNote (RIS/XML) – this seems to be the most widely adopted format
http://www.endnote.com/support/ensupport.asp – XML DTD is here
http://refdb.sourceforge.net/manual-0.9.4/c2166.html – RIS format description
- The following format is also a possibility, but it may be overly complex for our needs.RDFhttp://en.wikipedia.org/wiki/Resource_Description_Framework
- Here are other formats that have been looked at, but appear to be deficient in one way or another.UniXRef – this is the XML format CrossRef returns from their OpenURL resolver.The verbosity of this format is good; it appears that a document using this format it could contain all of the information that we require. However, it is unclear how much this format has been adopted outside of specialized custom applications.
http://www.crossref.org/help/Content/04_Queries_and_retrieving/Bulk%20metadata%20distribution.htm – schema found by clicking on “Unified XML Schema – Overview”
CoiNS – not widely adopted
MARC – doesn’t support article-level metadata (pages, etc)
Dublin Core – Not detailed enough
BibTeX – Not detailed enough
OAI outputs – only a few defined outputs, which happen to be formats that are defined elsewhere (Dublin Core, RFC1807, MARC)
RefWorks – too proprietary
If you have experience with one or more of these formats and would like to help us make our decision, please post your comments below.
Unixref, PubMed, or Prism are all good options. OAI does not serve the purpose at all, like Rod stated.
Oh, BTW, Dean raises an interesting point about inline markup. This is an issue in other fields as well.
RDF has support for the concept in its XML Literal, but you still need a way to markup it up. As I and other have been thinking about this more on the formatting output end (I wrote and maintain the CSL schema for citation config, which is what Zotero and Mendeley use), I've tended to think a small subset of HTML ought to be enough.
I'm one of the editors of the bibliography ontology RDF spec. Really depends on what your data needs are, but RDF is designed for extensibility, and bibo is a nice balance of simplicity and rigor (if I do say so myself); better than the other alternatives.
I've posted a Python bibo/rdf-object mapping here if you or your programmers are curious.
There's an issue for taxonomic literature that I haven't seen discussed yet: the requirement to be able to mark genus/species names as italic in titles. It seems trivial, but it's not. If that's not an intrinsic feature of the bibliographic format, then manual correction would be needed for publication use of any references. Ouch.
For that reason, I'm putting in a vote for NLM as the format. It does have explicit markup for character emphasis in titles (including italics).
After an hour of peering at METS/MODS and PRISM documentation, I can't see anything like that in either of those (but I'm definitely willing to be corrected on that).
I'd vote against using Endnote-related formats (XML or otherwise) as a core format (though their XML format does include markup for italics). The only advantage would be easy interoperability with Endnote software itself. But that could be broken by Thompson Reuters at any time, leaving us with an "orphaned" format that would require transformation for Endnote use anyway.
Well, Endnote is not open in the sense that the whoever owns endnote (I don’t know who it is nowadays) can change that at any time—and then what, we will be using an outdated version of it? Also, this primarily is intended for use by Endnote users.
I don’t see how Dublin core is going to have enough resolution to describe the kinds of works in the BHL.
Perhaps because I am a librarian, I don’t share Rod’s objections to METS/MODS/MARC. These have good resolution and there are good semantical guidelines for what each field should contain. I gather than many people don’t agree with me. I suppose this is an issue for another day.
I don’t see what the BHL wants out of this file format—a core storage format, that is then transformed into others for distribution as users need it, or else a single format that is good for both storage and for distribution.
I wonder if you might consider the bibliographic ontology?
Oi. Parabéns por seu excelente blog. Gostaria de lhe convidar para visitar meu blog e conhecer alguma coisa sobre o Brasil. Abração