[OAI-implementers] responsible harvesting (is good for both harvesters and repositories)

Simeon Warner simeon@cs.cornell.edu
Tue, 3 Sep 2002 11:24:03 -0400 (EDT)


Recent attempts to harvest from arXiv.org by tips.imag.fr 
consisted of 11700 requests spaced at about 3-4 per second, each one
receiving a 503 reply saying to wait 60s. 

Eventually this site was blocked automatically at our firewall by our
robot detection script, and even though 11700 requests adds a bunch of
junk to our logs it doesn't really hurt us. No 'From' address was supplied
with the requests so we have no way of contacting the operator of this
harvester. The net result of all of this is wasting a little of our time
and of the harvester operator's time, without achieving the original goal
of sharing arXiv metadata.

Lessons (see also sections 2 and 2.1 of 
http://www.openarchives.org/OAI/2.0/guidelines-harvester.htm):

1) include a 'From' header with requests
2) respect 503
3) make sure that a harvester will not blindly keep on making the same 
request for more than a few tries
4) put in a small delay between robot requests so that if there is a bug
the sequence of requests can never get too rapid.

Cheers,
Simeon.