SHARE

Friday, June 13, 2008

Updated Harvesting Process from the Internet Archive

Note: This is a revision of our previous blog post that described our process for harvesting digitized books from the Internet Archive. Their query interface changed, and we've updated our process & documentation accordingly.

Disclaimer: BHL is not directly or indirectly involved with the development of this query interface. We scan books through Internet Archive and are consumers of their services & interfaces. We have provided this documentation to help inform others of our process. Questions or comments concerning the query interface, results returned, etc., should be directed to the Internet Archive.

-Chris Freeland, BHL Technical Director


Overview
The following steps are taken to download data from Internet Archive and host it on the Biodiversity Heritage Library. Diagrams of the process are available in PDF.
  1. Get item identifiers from Internet Archive for items in the "biodiversity" collection that have been recently added/updated.
  2. For each item identifier:
    • Get the list of files (XML and images) that are available for download.
    • Download the XML and image files
    • Download the scan data if it is not included with the other downloaded files
    • Extract the item metadata from the XML files and store it in the import database.
    • Extract the OCR text from the XML files and store it on the file system (one file per page).
  3. For each "approved" item, clean up and transform the metadata into an "importable" format and store the results in the import database.
  4. Read all data that is ready for import and insert/update the appropriate data in the production database.
Internet Archive Metadata Files
The following table lists the key XML files containing metadata for items hosted by Internet Archive. It is possible that one or more of these files may not exist for an item. However, most items that have been "approved" (i.e. marked as "complete" by Internet Archive) do include each of these files.

Filename

Description

*iaidentifier*_files.xml

List of files that exist for the given identifier

*iaidentifier*_dc.xml

Dublin Core metadata. In many cases the data include here overlaps with the data in the _meta.xml file.

*iaidentifier*_meta.xml

Dublin Core metadata, as well as metadata specific to the item on IA (scan date, scanning equipment, creation date, update date, status of the item, etc)

*iaidentifier*_metasource.xml

Identifies the source of the item… not much meaningful data here

*iaidentifier*_marc.xml

MARC data for the item.

*iaidentifier*_djvu.xml

The OCR for the item, formatted as XML.

*iaidentifier*_scandata.xml

Raw data about the scanned pages. In combination with the OCR text (_djvu.xml), the page numbers and page types can be inferred from this data. This file may not exist, though in most cases it does. For the most part, only materials added to IA prior to late summery 2007 are likely to be missing this file

*iaidentifier*_scandata.xml

Raw data about the scanned pages.If there is no *iaidentifier*_scandata file for an item, we look in scandata.zip (via an IA API) for this file, which contains the same information.



Internet Archive Services
Search for Items
Internet Archive items belong to one or more collections. To search a particular Internet Archive collection for items that have been updated between two dates, use the following query:

http://www.archive.org/advancedsearch.php
?q={0}+AND+oai_updatedate:[{1}+TO+{2}]
&fl[]=identifier&fl[]=oai_updatedate&fmt=xml&xmlsearch=Search

where

{0} = name of the Internet Archive collection; in our case, "collection:biodiversity"
{1} = start date of range of items to retrieve (YYYY-MM-DD)
{2} = end date of range of items to retrieve (YYYY-MM-DD)

To limit the item search to a particular contributing institution, modify the query as follows:

http://www.archive.org/advancedsearch.php
?q={0}+AND+oai_updatedate:[{1}+TO+{2}]+AND+contributor:(MBLWHOI Library)
&fl[]=identifier&fl[]=oai_updatedate
&rows=100000&fmt=xml&xmlsearch=Search

To limit the results of the query to a particular number of items, modify the query as follows:

http://www.archive.org/advancedsearch.php
?q={0}+AND+oai_updatedate:[{1}+TO+{2}]
&fl[]=identifier&fl[]=oai_updatedate
&rows=100000&fmt=xml&xmlsearch=Search

To search for one particular item, use:

http://www.archive.org/advancedsearch.php
?q={0}&fl[]=identifier&fl[]=oai_updatedate
&fmt=xml&xmlsearch=Search

where

{0} = an Internet Archive item identifier

Download Files
To download a particular file for an Internet Archive item, use the following query:

http://www.archive.org/download/{0}/{1}

where

{0} = an Internet Archive item identifier
{1} = the name of the file to be downloaded

Downloading Files Contained In ZIP Archives
In some cases, a file cannot be downloaded directly, and may instead need to be extracted from a ZIP archive located at Internet Archive. One example of this is the scandata.xml file, which in some cases must be extracted from the scandata.zip file. To do this, two queries must be made. First invoke this query to get the physical file locations (on IA servers) for the given item:

http://www.archive.org/services/find_file.php
?file={0}
&loconly=1


where

{0} = and Internet Archive item identifier

Then, invoke the second query to extract the scandata.xml file from the scandata.zip file (using the physical file locations returned by the previous query):

http://{0}/zipview.php
?zip={1}/scandata.zip
&file=scandata.xml


where

{0} = host address for the file
{1} = directory location for the file

Note that the second query can be generalized to extract the contents of other zip files hosted at Internet Archive. The format for the query is:

http://{0}/zipview.php
?zip={1}/{2}
&file={3}.jpg


where

{0} = host address for the file
{1} = directory location for the file
{2} = name of the zip archive from which to extract a file
{3} = the name of the file to extract from the zip archive


Documentation written by Mike Lichtenberg.

Wednesday, June 4, 2008

WonderFetch(tm) & IA _meta.xml fields

Overview

WonderFetch is the term used for prepopulating the Internet Archives metadata forms (so named because it is more wonderful than regular z39.50 fetching). Using WonderFetch, partner libraries can populate fields with data that would not normally be populated as part of the standard IA process, and then store those values in the foobar_meta.xml file alongside each scanned item in the IA repository. Part of the impetus for implementing WonderFetch was not just to automate the inclusion of volume and issue information for serials – which was important – but to also capture due diligence, rights, and licensing information related to each item. (And yes, the TM is a little joke! No rights reserved).


How does it work?

WonderFetch is simply passing a series of parameters to the IA metaform software in a URL string. If you can do a Z39.50 query against your ILS and get your data into a format that will let you generate a URL (say, an HTML page output from a database, or a spreadsheet with your item data in it) you can WonderFetch!

To create WonderFetch links from your data, just append the relevant arguments and data (listed below) to one of 2 base URLs.

If your books are being “loaded”, that is, if the metadata fetch is occurring on a scribe2 machine and the scanner is using the WonderFetch while at the SCRIBE machine, use http://localhost.archive.org/biblio.php? as the base URL.

If your books are being pre-loaded or batch loaded on another computer before being scanned, or your SCRIBE is using the scribe1 software, use http://www.us.archive.org/biblio.php? as the base URL. Using this URL, your scanner person will notice that they can't actually start shooting the book - this link only allows them to fetch metadata and create the foobar_meta.xml and marc.xml records.

For example, at SIL the scanner loads the books at the scribe station, and we Z-fetch on our barcode number (starts with 39088) so our URLs look like:

http://localhost.archive.org/biblio.php?f=c
&b_c1=biodiversity&b_l=Smithsonian%20Institution%20Libraries
&b_p=Smithsonian&z_e=Smithsonian%20Institution
&b_v=v.%209%201907&b_cn=no
&z_c=local&z_d=39088009080136&b_ib=39088009080136



Below are the arguments each field takes, along with examples. The examples given are values BHL partners will be using (for rights statement, e.g.).


(For additional reference, some definitions and usage of the standard IA fields can be found in: http://www.us.archive.org/biblio?f=usage )





LIST OF FIELDS in META_XML that can be pre-populated using WonderFetch:


CALL_NUMBER

Description: Number for ZQuery – this is *not* necessarily the call number of the item. It *is* the number used to fetch the MARC record for the item via Z39.50 from your ILS. Whatever works for you – barcode, bib number, oclc number, call number, etc.
Prepopulatable: Yes.
WonderFetch GET arg: "&z_d="
example: &z_d=39088009080136

IDENTIFIER-BIB
Description: Unique identifier for the item in Contributor library's catalog.
Prepopulatable: Yes
WonderFetch GET arg: "&b_ib="
example: &b_ib=39088009080136

TITLE-ID
Description: Unique identifier for title in Contributor library's catalog.
Prepopulatable: Yes
WonderFetch GET arg: "&tid="
example: &tid=b12345

VOLUME
Description: Volume of Book being scanned.
Prepopulatable: Yes.
WonderFetch GET arg: &b_v=
example: &b_v=v.%209%201907 (v. 9 1907)

YEAR
Description: Year assigned to Book being scanned
Prepopulatable: Yes.
WonderFetch GET arg: &year=
example: &year=1907


COLLECTION

Description: Collection(s) into which Book will be sorted.
Prepopulatable: Yes
WonderFetch GET arg: "&b_c1=" (c1 indicates primary collection, there is also c2 and c3)
example: &b_c1=biodiversity

CONTRIBUTOR

Description: Library contributing Book for scanning
Prepopulatable: Yes
WonderFetch GET arg: "&b_l="
example: &b_l=Smithsonian%20Institution%20Libraries

SPONSOR
Description: Organization responsible for funding scanning
Prepopulatable: Yes
WonderFetch GET arg: "&b_p="
example: "&b_p=Sloan"

SCANNINGCENTER
Description: Scanning Center where book was scanned.
Prepopulatable: yes
Get arg: "&b_n="
example: &b_n=Boston

DUE-DILIGENCE
Description: URL to Due Diligence statement for a Book scanned while still in Copyright.
Prepopulatable: Yes
WonderFetch GET arg: &dd=
examples: &dd=dd-bhl (this signifies a due diligence statement exists at the following url http://www.biodiversitylibrary.org/permissions )

LICENSE-TYPE
Description: Creative Commons License assigned to Book scanned.
Prepopulatable: Yes
WonderFetch GET arg: "&lic="
&lic= by ( http://creativecommons.org/licenses/by/3.0/)
&lic= by-nc ( http://creativecommons.org/licenses/by-nc/3.0/)
&lic= by-nd ( http://creativecommons.org/licenses/by-nd/3.0/)
&lic= by-sa (http://creativecommons.org/licenses/by-sa/3.0/)
&lic= by-nc-nd ( http://creativecommons.org/licenses/by-nc-nd/3.0/)
&lic= by-nc-sa ( http://creativecommons.org/licenses/by-nc-sa/3.0/)

NEGOTIATED-RIGHTS
Description: URL to Negotiated Rights for a Book scanned while still in Copyright.
Prepopulatable: Yes
WonderFetch GET arg: "&rights="
&rights=nr-bhl (this signifies that a statement of right to digitize exists at the following URL: http://www.biodiversitylibrary.org/permissions/)

POSSIBLE-COPYRIGHT-STATUS
Description: indicates copyright status, defaults to Not in Copyright
Prepopulatable:yes
Get arg: "&pcs="
To make this field blank, because it *is* in copyright, but you have permission to digitize, pass a url encoded space as the parameter, e.g. &pcs=%20

- Keri Thompson, Smithsonian Libraries (thompsonk@si.edu)