File formats for citation storage and distribution
We’ve been investigating options for storage and distribution of citation data in the Biodiversity Heritage Library. In particular, we are searching for an appropriate “core” format. The thought is that with an appropriately verbose, open, standard core format for our citations, we can transform that format into whatever other format we might want to support. By “verbose”, we mean a format that can support all of the information that we need to preserve. By “open”, we’re looking for a format that’s not tied exclusively to one system or vendor. And by “standard”, we’re hoping to identify a format that is widely recognized by the library community.
Some of the information found in this Wikipedia article has guided the research: http://en.wikipedia.org/wiki/Comparison_of_reference_management_software. Specifically, the information found there about which formats are supported in each of the various applications is useful.
Following is a brief description of the format candidates we’ve investigated, as well as our preliminary conclusions.
- The following formats appear (at the first look) to be the most open, verbose, and recognized formats.METS/MODS – Library of Congress standardshttp://www.loc.gov/standards/mets/
http://www.loc.gov/standards/mods/ – examples can be found under the “Guidance” section
NLM – National Library of Medicine format
http://dtd.nlm.nih.gov/ – DTDs
http://www.ncbi.nlm.nih.gov/staff/beck/citations/citationtags.html – examples
EndNote (RIS/XML) – this seems to be the most widely adopted format
http://www.endnote.com/support/ensupport.asp – XML DTD is here
http://refdb.sourceforge.net/manual-0.9.4/c2166.html – RIS format description
- The following format is also a possibility, but it may be overly complex for our needs.RDFhttp://en.wikipedia.org/wiki/Resource_Description_Framework
- Here are other formats that have been looked at, but appear to be deficient in one way or another.UniXRef – this is the XML format CrossRef returns from their OpenURL resolver.The verbosity of this format is good; it appears that a document using this format it could contain all of the information that we require. However, it is unclear how much this format has been adopted outside of specialized custom applications.
http://www.crossref.org/help/Content/04_Queries_and_retrieving/Bulk%20metadata%20distribution.htm – schema found by clicking on “Unified XML Schema – Overview”
CoiNS – not widely adopted
MARC – doesn’t support article-level metadata (pages, etc)
Dublin Core – Not detailed enough
BibTeX – Not detailed enough
OAI outputs – only a few defined outputs, which happen to be formats that are defined elsewhere (Dublin Core, RFC1807, MARC)
RefWorks – too proprietary
If you have experience with one or more of these formats and would like to help us make our decision, please post your comments below.
A further reason to go with the NLM format is that BHL has already digitized and will in the future digitize back issues of journals that are currently being published in the NLM DTD, e.g. all the BioOne journals. Making the older journal literature _easily_ interoperable with current and future will enable new services. If BHL supports this, content holders will be more willing to deposit content with us.
NLM looks pretty good to me. WRT BibTeX – my understanding is that most programs that’ll read RIS will also read BibTeX – being the oldest format. It also looks like a lot of the heavy XSLT lifting’s been done by a project called BibTeXML – from this page on the project site:
Presentation and exchange with XSLT
XSLT specs for converting BibTeX XML markup to BibTeX (LaTeX syntax), Dublin Core, MODS, RIS, DocBook, LaTeX biblist environment, or HTML using Harvard, Chicago or APA citation styles. There is also a testbed for generating HTML with links to the Amazon.com bookstore.
NLM looks very comprehensive for citations at title or item level. I would advise using METS/MODS too
as it is an open digital library standard; most of us are coming from MARC metadata background (so easy to migrate to that) and the additional admin/technical data METS holds allows a more complete solution to the metadata issue.
I also support COINS. I didn’t realise we were already using this at item level (cool!) but would like to see it extended to the BHL hit list level if that’s doable within the standard. Here’s why: With zotero, if you get a hitlist, you could pull all the hits into Zotero in one go. Also, if configuring a federated search system (e.g Metalib which we are building for EDIT, a project which has connections to BHL-E), we can instantly create a combined BHL + JSTOR (or database of your choice system) search, because there is metadata to grab. This could be a very powerful research tool potentially.
Yes, I’m happy with NLM… I think they do great work, and if they’re working on a “taxonomic extension to the schema that would accommodate taxonomic/nomenclatural acts” which is good for BHL, then I can’t complain!!!
Tom, yes, we would support both our ‘core’ format and OAI_DC in OAI.
I’m thinking that NLM is our best pick. None of the schema contain some of the weird stuff we encounter in historic biodiversity literature, so *any* and *all* of them have deficiencies in meeting our requirements at 100%. I’m favoring NLM because 1) it’s widely used by the contemporary publishing industry, including PLoS, and bringing historic publications into a modern publishing environment is really, really cool, and 2) the GoldenGate/Plazi folks have been in discussion with NLM about a taxonomic extension to the schema that would accommodate taxonomic/nomenclatural acts. That’s huge for us.
For those reasons, I think I’ve convinced myself that our core format should be NLM, then XSLT to everything everyone wants.
Have I convinced you??
I suppose what I was alluding to earlier is also to do with versions of the EndNote, etc. XML – how will you determine whether it’s worth migrating from one version of an XML format to another?
I’m not sure I have anything to add about the “openness” – Thomson Reuters would be the people to talk to (assuming you haven’t already). I suppose I’m personally more likely to feel uneasy about going with a corporate entity’s format, that with that of a group such as CrossRef or a more “public good” entity such as NLM.
Ultimately though, it’s your call, and you need to go with the best fit.
I take it you’re still implementing an OAI interface – would you XSLT to that too, or will you also support the native format you choose for storage e.g. NLM as well as OAI_DC?
Rod, Tom – Exactly the feedback we wanted! Agreed, Rod, we want to pick one core format and XSLT to everything else. Do either of you have opinions on the “openness” of UniXRef or EndNote XML, or is that even a concern here given that both have a published schema?
Okay, sorry NLM > PubMed, but you get the idea.
I'll be interested to see how many strong candidates for BHL needs there are!
Perhaps a simple poll when all’s said and done would be a good idea?
I have some reservations about BHL hooking up their cart to someone else’s horse. – you’d need to make sure you’re headed in the same direction – i.e. How aligned is BHL with the types of resources that are described by each of these formats?
I generally agree with Rod’s comments. My personal prefs. are based around familiarity:
What I’m looking for in a file format for citation storage is something which is granular enough, and supports a wide range of resource types, yet is fairly readily understood by looking at a couple of examples.
I’m glad you haven’t considered RDA (have you seen the docs. for it?). NLM and NAL are “looking at it” http://www.loc.gov/bibliographic-future/rda/
Quick (biased) comments.
1. Endnote is not the same as RIS. RIS is defined here: http://www.refman.com/support/risformat_intro.asp Endnote supports this, but has it’s own native format (as well as it’s own XML).
Support for RIS is essential (makes bulk import easy), pleases users. One problem is author names aren’t atomised into components, so consuming software has to do this.
2. COinS isn’t very widely supported, but Zotero makes use of it, so won’t hurt to use it (as you already do).
3. METS/MODS/MARC don’t exist outside of librarians’ heads. Avoid like plague, few will use it. Librarians are not (or should not be) the target audience (see also OAI)
4. NLM great for within document citations, doubt it’s used outside publishers preparing manuscripts.
5. RDF Is great, and you missed the PRISM vocabulary (1.0 is here, 2.0 is out http://www.prismstandard.org).
See http://ci.nii.ac.jp/info/en/if_rdf.html for one nice example. Problem is there are competing vocabularies and the winner isn’t clear (although publishing industry mostly use Dublin Core and PRISM).
6. UniXRef I use this as I consume CrossRef services, but not aware of anyone using it outside of CrossRef.
7. BibTeX is great for LaTeX users like me, but probably not great as a core format
8. OAI sucks, far too vague, allows users to store poorly formatted junk in metadata fields (just take a look at any DSpace OAI output).
Personally, I’d pick one of Endnote XML, NLM XML, or UniXRef, and then generate alternative formats using XSLT. Make core XML, RIS, and COinS available to users.