ALCME: OAIHarvester Project


OAIHarvester is a Java application framework that harvests metadata from OAI-compliant servers.

License Information

OCLC Office of Research Public License

Applications of OAIHarvester

Given a set of URLs for OAI-compliant servers, OAIHarvester will query the servers for the most current metadata records. The harvester can be configured to perform custom functions on the set. The default function is to dump the records to standard output (ORG.oclc.oai.harvester.OAIDisplayFunction). An OAIUnionCatalogFunction is also provided, which updates a PearsOAICatalog-based union catalog. The underlying database of OAIUnionCatalogFunction can be changed, however, by extending the AbstractCatalog interface using the target database engine's API to implement the abstract methods. It's also possible to change the default function to perform any arbitrary task on the set by extending the OAIHarvestFunction class.

Demo

The OAIHarvester application is available here for evaluation purposes. This demo harvests the OCLC ETD repository and dumps the records to standard output. Right click on the following links and save the target/link to a local subdirectory. NOTICE! Be sure the file names match those listed when saving them to your local machine.

Next, obtain Xerces and Tomcat from the Apache Web site and install them. Locate the xerces.jar and servlet.jar in the installed packages and copy them to the target directory or adjust the following command's classpath to point to their installed location.

Finally, issue the command 'java -classpath servlet.jar:xerces.jar:pears.jar:harvester.jar ORG.oclc.oai.harvester.Harvester'.

Configuration Mechanism

The OAIHarvester properties file defaults to the file named harvester.properties in the current directory. This default can be overridden, however, using command line switches when running the application. Regardless, this file must contain the following key=value pairs:

Target OAI Repository Specification

The list of OAI Repositories to be harvested and the date of last harvest is maintained via the ORG.oclc.oai.harvester.OAIServerSet interface. The default implementation of this interface is ORG.oclc.oai.harvester.SimpleOAIServerSet. To initialize this function, create an XML data file containing the list of servers and an optional start date for each. The serverset.xml file should take the form of:

<HarvesterAdmin>
  <OAIServer>
    <baseURL>http://purl.org/alcme/harvestcat/servlet/OAIHandler</baseURL>
    <lastHarvestDate>2000-01-01</lastHarvestDate>
  </OAIServer>
  <OAIServer>
    <baseURL>http://purl.org/alcme/etdcat/servlet/OAIHandler</baseURL>
    <lastHarvestDate></lastHarvestDate>
  </OAIServer>
  ...
</HarvesterAdmin>
    
Given this input file, you can execute the command:
java ORG.oclc.oai.harvester.SimpleOAIServerSet serverset.xml simpleoaiserverset.ser. The result is a serialized OAIServerSet object that can be used by the harvester by assigning the following properties in the harvester.properties file: OAIServerSet.className=ORG.oclc.oai.harvester.SimpleOAIServerSet
SimpleOAIServerSet.collectionFileName=simpleoaiserverset.ser
If you choose to use a different implementation of the OAIServerSet, then change the className property appropriately. Once the .ser file is created, the Harvester will update the lastHarvestDate and rewrite the file.
Jeffrey A. Young
Last modified: Fri Aug 10 16:42:21 EDT 2001