BHL News, Blog Reel, Tech Updates

Harvesting Process from Internet Archive

NOTE: Internet Archive has changed their query interface and these instructions are no longer valid.

New instructions are available at:
https://blog.biodiversitylibrary.org/2008/06/updated-harvesting-process-from.html

Overview
The following steps are taken to download data from Internet Archive and host it on the Biodiversity Heritage Library. Diagrams of the process are available in PDF.

Get item identifiers from Internet Archive for items in the “biodiversity” collection that have been recently added/updated.
For each item identifier:

Get the list of files (XML and images) that are available for download.
Download the XML and image files
Download the scan data if it is not included with the other downloaded files
Extract the item metadata from the XML files and store it in the import database.
Extract the OCR text from the XML files and store it on the file system (one file per page).

For each “approved” item, clean up and transform the metadata into an “importable” format and store the results in the import database.Read all data that is ready for import and insert/update the appropriate data in the production database.Internet Archive Metadata Files
The following table lists the key XML files containing metadata for items hosted by Internet Archive. It is possible that one or more of these files may not exist for an item. However, most items that have been “approved” (i.e. marked as “complete” by Internet Archive) do include each of these files.

Filename

Description

_files.xml

List of files that exist for the given identifier

_dc.xml

Dublin Core metadata. In many cases the data include here overlaps with the data in the _meta.xml file.

_meta.xml

Dublin Core metadata, as well as metadata specific to the item on IA (scan date, scanning equipment, creation date, update date, status of the item, etc)

_metasource.xml

Identifies the source of the item… not much meaningful data here

_marc.xml

MARC data for the item.

_djvu.xml

The OCR for the item, formatted as XML.

_scandata.xml

Raw data about the scanned pages. In combination with the OCR text (_djvu.xml), the page numbers and page types can be inferred from this data. This file may not exist, though in most cases it does. For the most part, only materials added to IA prior to late summery 2007 are likely to be missing this file

scandata.xml

Raw data about the scanned pages. If there is no _scandata file for an item, we look in scandata.zip (via an IA API) for this file, which contains the same information.

Internet Archive Services
Search for Items
Internet Archive items belong to one or more collections. To search a particular Internet Archive collection for items that have been updated between two dates, use the following query:

http://www.archive.org/services/search.php
?query={0}+AND+updatedate:[{1}+TO+{2}] &submit;=submit

where

{0} = name of the Internet Archive collection; in our case, “collection:biodiversity”
{1} = start date of range of items to retrieve
{2} = end date of range of items to retrieve

To limit the item search to a particular contributing institution, modify the query as follows:

http://www.archive.org/services/search.php
?query={0}+AND+updatedate:[{1}+TO+{2}]+AND+contributor:(MBLWHOI Library)
&submit;=submit

To limit the results of the query to a particular number of items, modify the query as follows:

http://www.archive.org/services/search.php
?query={0}+AND+updatedate:[{1}+TO+{2}] &limit;=1000
&submit;=submit

To search for one particular item, use:

http://www.archive.org/services/search.php
?query={0}
&submit;=submit

where

{0} = an Internet Archive item identifier

Download Files
To download a particular file for an Internet Archive item, use the following query:

http://www.archive.org/download/{0}/{1}

where

{0} = an Internet Archive item identifier
{1} = the name of the file to be downloaded

Downloading Files Contained In ZIP Archives
In some cases, a file cannot be downloaded directly, and may instead need to be extracted from a ZIP archive located at Internet Archive. One example of this is the scandata.xml file, which in some cases must be extracted from the scandata.zip file. To do this, two queries must be made. First invoke this query to get the physical file locations (on IA servers) for the given item:

http://www.archive.org/services/find_file.php
?file={0}
&loconly;=1

where

{0} = and Internet Archive item identifier

Then, invoke the second query to extract the scandata.xml file from the scandata.zip file (using the physical file locations returned by the previous query):

http://{0}/zipview.php
?zip={1}/scandata.zip
&file;=scandata.xml

where

{0} = host address for the file
{1} = directory location for the file

Note that the second query can be generalized to extract the contents of other zip files hosted at Internet Archive. The format for the query is:

http://{0}/zipview.php
?zip={1}/{2}
&file;={3}.jpg

where

{0} = host address for the file
{1} = directory location for the file
{2} = name of the zip archive from which to extract a file
{3} = the name of the file to extract from the zip archive

Documentation written by Mike Lichtenberg.

Scanning Operations, Technical Notes

March 14, 2008

Written by Chris Freeland

Chris Freeland served as the BHL Technical Director from 2006-2012. He is currently the Director of the Open Libraries program at Internet Archive. In this capacity he works with libraries & publishers to digitize their collections, working towards the Archive’s mission of providing “universal access to all knowledge.”

1 Comment

Jonathan Rochkind May 19, 2008 at 2:42 pm Reply

This is incredibly frustrating that the XML response from IA has dissappeared. I guess I’m glad I didn’t write any code to it yet that has now broken… but I was about to use it for something cool. Now what? Has anyone found any alternative to machine-accessible interface to searching the IA corpus?

Cancel Reply

About BHL

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. Headquartered at the Smithsonian Libraries and Archives in Washington, D.C., BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access as part of a global “biodiversity community.”

Harvesting Process from Internet Archive

Related Posts

1 Comment

Leave a Comment

Cancel Reply

Help Support BHL

Search

About BHL

Follow BHL

Join Our Mailing List

Subscribe to our Blog Via RSS

Harvesting Process from Internet Archive

Related Posts

Quello che era nuovo in TDWG 2013?

BHL Moves to HTTPS

Internet Archive scanning: Behind the scenes at the Biodiversity Heritage Library

1 Comment

Leave a Comment

Cancel Reply

Help Support BHL

Search

About BHL

Follow BHL

Join Our Mailing List

Subscribe to our Blog Via RSS