[OAI-implementers] harvester tools

Kat Hagedorn khage at umich.edu
Mon Aug 9 11:56:01 EDT 2004

Thank you to everyone who responded about harvester tools. Each person 
seems to be using a different harvester (!), but the comments have 
given us an idea of where we want to focus our attentions. Two people 
also let me know in person about harvesters they use. Those are listed 
first below.

Thanks again,
- Kat

On Jul 20, 2004, at 2:15 PM, Kat Hagedorn wrote:

> Hello all,
> We are investigating switching to a different harvester tool and 
> thought that a good first step would be to poll this list about their 
> use of harvesters.
> If you harvest OAI records:
> 1. What harvester tool do you use? Version number?
> 2. Are you pleased with the tool? What do you like and not like about 
> it?
> Please send responses directly to me and I'll summarize for the list. 
> (Anonymously if preferred.)
> Thanks,
> - Kat
> -------------------
> Kat Hagedorn
> OAIster/Metadata Harvesting Librarian
> DLXS Bibliographic Class Coordinator
> DLXS Text Class Collections Co-coordinator
> Digital Library Production Service
> University of Michigan
> http://www.oaister.org/
> http://www.dlxs.org/
> email: khage at umich.edu
> phone: 734-615-7618
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://openarchives.org/mailman/listinfo/oai-implementers


Virginia Tech Perl Harvester


Simeon's Perl Harvester
contact Simeon Warner for more info (simeon at cs.cornell.edu)


         I really like the simple perl harvester (MyOAI)
  not sure if there is any more development  on this
  harvester but it is great to trouble shoot problems
  with your broker/provider and it pretty much harvest
most of the proviser sites but there is no gui frontend
pretty much a unix type of application that you have to
configure files and then run it on the unix shell command
It was a big help to us when we were setting seven new
data providers with various poblems and was able to
  turn logging on ands see each http request being sent
  for all of the OAI verbs (ListRecords, ListIdentifeirs...etc)

I've written several harvesters now, and I'm not happy with any of 
them. The
problem is that so many repositories have badly encoded characters that 
can't rely on DOM or SAX during the harvesting process without having 
choke on the bad characters.

Harvesters are trivial to write. Thom Hickey wrote one with a single 
page of
Python code (http://www.oclc.org/research/software/oai/2page.htm) and I
wrote one that is even simpler (albeit a bit longer) that I wrote in 

Because they all rely on the data being good, though, they fail way too

My advice is to find an implementation that captures the responses as 
bytes and then greps for the resumptionToken rather than rely on XML 
to parse for it. A page or two of code is all it should take.


I just started using REAP from UIUC. It is Windows based. After using it
only two days it seems quite capable. It may prove to be weak in spots
but those I probably won't be aware of for a few weeks.


We used to use ARC from Old Dominion, but my digital library research 
crew now
just codes up ad hoc harvesters for different applications.  We've 
various code chunks that do various parts of the process.  We've also
experimented with Greenstone's harvester module for smaller 


I use 'harvester2' from Jeff Young (OCLC) : 

I like this tool because it is a simple library. I needed this type of 
library for my project.
But there is some problems (bugs) : with the 'retry later' it seems to 
retry indefinitly. With compression, if the Content-Encoding is null 
and the content encoded, it does not detect it.


Internally, in the LANL repository infrastructure, we use OCLC's 
OAIHarvester version 1 for _big time_ harvesting of complex objects 
(not DC records but actual content represented using MPEG-21 DIDL).

We have built the OAI-PMH Federator (see our JCDL paper 
on the basis of this Harvester.

We love it.  It's faster than OAIHarvester2.  Jeff Young keeps 
supporting it, and actually implemented optimizations as a result of 
our feedback.  On demand.  What more can one ask for!?


I use celestial http://celestial.eprints.org/. last version and I 
update the version when I can.

Yes. [pleased with the tool]

I like the web interface and it is written in perl.

I don't like:
-how it manage the sign '&' [It trasform them in "&"
-I can't harvest a selection of sets. I need to harvest all site.

More information about the OAI-implementers mailing list