[OAI-implementers] Net::OAI::Harvester

Ed Summers ehs@pobox.com
Tue, 8 Jul 2003 13:45:26 -0400


A beta version of a Perl OAI-PMH harvesting library was just uploaded to
CPAN as Net::OAI::Harvester. The idea behind Net::OAI::Harvester is to
provide an object-oriented client interface to the data found in OAI-PMH 
repositories (similar to what LWP::UserAgent does for HTTP). 

More about OAI-PMH can be found here:
     http://www.openarchives.org

And more about Net::OAI::Harvester can be found here:
     http://search.cpan.org/author/ESUMMERS/OAI-Harvester-0.1/

All of the 6 OAI-PMH verbs are supported. As an example here is the code to 
retrieve a particular record from LC as Dublin Core and display the title.

     my $harvester = Net::OAI::Harvester->new(
          baseUrl => 'http://memory.loc.gov/cgi-bin/oai2_0'
     );

     my $record = $harvester->getRecord(
          identifier     => 'oai:lcoa1.loc.gov:loc.gmd/g3764s.pm003250',
	  metadataPrefix => 'oai_dc'
     );

     my $metadata = $record->metadata();
     print "title: ", $metadata->title(), "\n";

Features:

- OAI-PMH responses can often be rather large XML files. Net::OAI::Harvester 
  uses stream based parsing (XML::SAX) and serializes data as Perl objects on 
  disk (using YAML). This serialized data is then made available through
  an iterator interface which means that you keep a relatively low
  memory foot print when doing ListRecords or ListIdentifiers requests.

- Net::OAI::Harvester includes Net::OAI::Record::OAI_DC which is an
  XML::SAX handler for parsing and providing an object oriented
  interface to baseline Dublin Core metadata. It also provides a
  framework for dropping in your own XML::SAX handler if you want to
  parse other types of metadata. The idea is that as people create their
  own handlers they can be easily included in the Net::OAI::Harvester
  distribution.

- If you are interested in the XML itself you can easily get a hold of the 
  temporary file that contains the full XML response, and do what you want 
  with it.

- You can easily can a hold of the error code and message associated with any 
  request.

Caveats:

- Net::OAI::Harvester only supports OAI-PMH v.2.

- No support for compression (yet).

- Needs more documentation, and examples.

- You need to handle resumptionTokens explicitly. This means a call to 
  listRecords() will not go and grab everything, but just the first chunk. 
  However, there is infrastrucutre and methods to easily get at and pass the 
  tokens.

Feedback/comments/testser would be appreciated. If you are at all interested in 
getting involved in the project please write to me directly, or (preferably) 
use perl4lib@perl.org or oai-implementers@oaisrv.nsdl.cornell.edu.

//Ed