[OAI-implementers] Re: [OAI-general] OAI interop site back up - 76 of 126 sites fail ing

'Alan Kent' ajk@mds.rmit.edu.au
Tue, 12 Nov 2002 11:13:18 +1100


On Mon, Nov 11, 2002 at 10:28:33AM -0500, Young,Jeff wrote:
> I see that Alan hasn't been able to do a complete harvest of our XTCat
> repository yet. I also know that he isn't alone in this failure. One
> possible reason for this might be that the XTCat repository itself is flaky.
> Another possibility, though, is that people are writing harvesters that
> ignore the potential for unavoidable problems when harvesting large
> repositories. With 4+ million records, XTCat is probably the largest
> repository out there by far. My concern, then, is that OAI seems to be fine
> for all the "toy" ;-) repositories out there now, but people planning large
> repositories should be skeptical about its efficacy.
> 
> I wrote some suggestions on the XTCat home page
> (http://alcme.oclc.org/xtcat/index.html) that I think can help, but I doubt
> that anyone has adopted any of these suggestions. These suggestions are to:
> 1) retry failed requests a few times before giving up, 2) retry later with
> the last resumptionToken rather than restart from scratch, and 3) write the
> responses to a file for later processing rather than perform synchronous
> updates to a database.
> 
> Even if XTCat is occasionally flaky, which may or may not be the case, these
> actions can greatly mitigate any problems that occur.
> 
> Jeff

I may actually turn off harvesting xtcat - I forgot it was 4,000,000
records! I think I will run out of disk space trying to harvest it. :-)

In terms of the retry suggestions, our code does remember the last
resumption token of when it failed. Currently we do not retry immediately
(which we could do), but rather since the harvests are scheduled regularly,
we wait until the next harvest time comes along and attempt to resume from
then. So we do (2) but not (1) at present. We don't do (3) because
we find that updating the database takes about the same time as the HTTP
request (ie, its not a significant bottle neck). (This is not always true
- it depends on the site.) It also avoids big temporary files (our
database compresses the records, so the database *plus* full text indexes
is often smaller than the input data.) But if you have a slow database
engine I understand what you are getting at.

The latest request for xtcat returned a HTML page by the way.
In the Apache generated HTML it had the text

    "The requested resource (/xtcat/servlet/OAIHandler) is
    not available."

A previous day we had DNS lookup error! (alcme.oclc.org).

Is this any criticism of xtcat? No, more that I agree completely that
doing long duration things over HTTP is prone to all sorts of little
unexpected problems.

So I will look at putting (1) on the feature list (immediate retries),
but only for OAI data providers that report that they support idempotent
resumption tokens (which xtcat I assume does).

But I better find some more disk space before I try to complete the
harvest!!!! (I have around 1,100,000 records so far.)

Alan