[OAI-implementers] Better resumption mechanism - more importa nt than ever!
Tue, 05 Mar 2002 15:24:16 -0500
this harvesting discussion is quite fascinating. here are some thoughts
on how i approach harvesting:
personally, my harvesters do not reissue resumptionTokens if something
goes wrong - rather, an error report is filed (usually by sending me
email and/or logging the problem) and harvesting is aborted. i will
someday include automatic exponential backoff/retry but the need hasn't
been great enough to warrant that just yet.
a single harvesting operation is considered to be a sequence of
requests, including as many resumptions as is necessary. the "last
harvested date" is only updated when a single harvesting operations is
completed successfully. this ensures that if any archive uses internal
state and the network fails, i will not lose any records. while speed
and recovery are important to me, integrity of the data is more so since
most of my harvesters are part of hierarchical systems and i cannot
afford to have data go missing in the early stages.
as far as scheduling goes, i run independent processes for each archive
with a 2-level scheduling system (firstly, how often to check the global
schedule, and secondly, how often to harvest the individual archives).
failures are signalled by a fault report and a persistent lock on the
archive. after i investigate what caused the failure i remove the lock -
(it should be trivial to have the retry mechanism do this in future).
to state the stateless resumption problem a little differently:
is the OAI protocol idempotent ?
if a request is submitted twice, will the responses be identical ?
obviously not, since new records could have been added. maybe we need a
weaker condition - like, if req2 is issued after req1, then the response
res2 must contain at least all of the contents of the response res1.
would this work ? i don't think so - if we update a record, its
datestamp would cause it to move out of range of a from/until
specification. is this the only case where the weaker condition fails ?
if so, what if we ditch the until parameter ?
it would be nice to have a rigorous mathematical framework within which
we can reason about the stability of algorithms related to the OAI-PMH.
until i come up with one ;), i'm just sticking with my best judgement
(no repeated resumptiontokens and no until parameters).
Michael L. Nelson wrote:
> On Tue, 5 Mar 2002, Young,Jeff wrote:
>>I'd be very disappointed if ETDCat required custom and unique consideration
>>from harvesters merely because of its size. Partitioning the collection
>>would be a case in point. The implication seems to be that harvesters would
>>somehow know to query the list of sets and then loop through each of them.
>>How would an arbitrary harvester know to do that, and is their software even
>>capable of it without custom coding? It would also prevent me from using
>>sets for legitimate purposes since I couldn't distinguish between them.
> my original point about sets was not intended to be the primary point. if
> ETDCat already uses sets, then Alan's harvester should investigate
> harvesting by set, since it will naturally partion the collection (modulo
> Liu's point about not all records are guaranteed to be in sets).
>>I'd be happy to implement stateless resumptionTokens, but unless harvesters
>>know how to use them for recovery, why bother? How many harvesters today
>>could manage a recovery using stateless resumptionTokens? How many
>>harvesters will handle it tomorrow if OAI remains agnostic on the issue?
> I'm not sure, but I would guess this would be the default behaivor. If
> the harvester chokes, I would start again where it left off. If
> successful harvesting continues, then there was a transient error. If it
> fails again, then maybe the repository has a problem.
> In 2.0, it will be even easier to determine:
> - when the resumptionToken expires
> - how big the result set is
> - and how many records the repository has transmitted so far
> Any harvester writers our there care to comment? Liu? Hussein?
>>I'm sure ETDCat needs more stress testing to minimize future failures. The
>>fact that we've discussed this before, though, indicates a recognition that
>>problems can happen. I don't blame Alan if he doesn't want to negotiate
>>special rules for harvesting ETDCat merely because the risk is proportional
>>to the size of the repository.
> (I hope no one interprets my comments as beating up on ETDCat)
> These are interesting points: to the best of my knowledge, if ETDCat has
> 4M records, its by and far the biggest OAI repository out there. My point
> is that 4M of anything is a big number, and repositories that large need
> to make sure they implement features that facilitate fault-tolerant
> harvesting. Stateless (or very long lived) resumptionTokens would appear
> to be one of those features. Also, if you have 4M records all with the
> same datestamp, this would seem to be an ideal candidate for some response
> caching techniques, which tied with very long lived (2-3 days?)
> resumptionTokens would seem to make for an efficent load on your end.
>>From: Michael L. Nelson [mailto:email@example.com]
>>Sent: Tuesday, March 05, 2002 10:03 AM
>>To: 'Alan Kent'
>>Cc: OAI Implementors
>>Subject: Re: [OAI-implementers] Better resumption mechanism - more
>>important than ever!
>>actually, the way I see it is the protocol should not be complicated with
>>additional tokens and such to enforce what ETDCat (and similiarly
>>large-sized DPs) should do:
>>1. partition their collection into sets
>>2. use stateless (or very long lived) resumptionTokens
>>in 2.0, resumptionTokens will have optional attributes, including
>>"expirationDate", so this will take the guess work out of knowing how long
>>a resumptionToken will be valid.
>>IMO, introducing an optional restartToken is no different (from an
>>implementer's point of view) than making the resumptionToken last a long
>>at some point, you (as a harvester) are simply at the mercy of the
>>repository. new features in the protocol won't change that.
>>On Tue, 5 Mar 2002, 'Alan Kent' wrote:
>>>I just got some mail from Jeff at OCLC talking about ETDCat (hope
>>>you don't mind me quoting some of your mail Jeff). In particular,
>>>he just told me
>>> ETDCat contains a lot of records (over 4 million), all of
>>> which currently have the exact same datestamp from the initial load.
>>>He also told me that there were no sets. So basically, its all
>>>or nothing for this site because OAI has no standard way to resume
>>>if a transfer fails.
>>>If this has happened already, I think its likely to occur again.
>>>(That is, one very large database all with the same time stamp.)
>>>So any comments about having a single large collection like this
>>>is beside the point. The point is OAI does not handle it well.
>>>So I would like to resurrect the discussion again if people don't
>>>mind on how to do support restarts. My understanding of the general
>>>feeling so far is
>>>(1) Mandating support is not going to be acceptable
>>>(2) Mandating format of resumption tokens is not going to be acceptable
>>>(3) Mandating resumption tokens be long lifed (eg: can try again the
>>> following day) is not acceptable
>>>(4) In fact, mandating that resumption tokens be unique (allowing
>>> a token to be reused twice in quick succession to get the same
>>> data) is not acceptable
>>>So any proposal needs to be optionally supported.
>>>Does anyone else think that this is a major hole in OAI? I personally
>>>do. After trying to crawl sites, things go wrong. The larger the site,
>>>the greater the probability that something will go wrong. The larger
>>>the site, the greater the pain of starting all over again. I do not
>>>think it is practical for anyone to harvest ETDCat if is really got
>>>4,000,000 records. Any fault, and start downloading that 4gb again!
>>>So I feel strongly on this one. In fact, I think this is the most
>>>major problem OAI has.
>>>Do people think its better to reuse resumption tokens for this purpose,
>>>or introduce a different sort of token? ETDCat for example I think
>>>allocates a session id in resumption tokens, meaning they cannot
>>>be reused when the session times out in the server (similar semantics
>>>anyway). This is a reasonable implementation decision to make.
>>>So maybe its better for servers to return an additional token,
>>>which is a <restartToken> which means a client can instead of
>>>specifying from= and to= again, specify restartToken= instead where
>>>the server then automatically works out whatever other parameters
>>>it needs, creates a new session etc internally. The new 'session'
>>>(ListXXX verb) then can use resumptionTokens to manage that new
>>>The idea is for a <restartToken> to be long lifed. It may be less
>>>efficient to use than a resumptionToken, but its only purpose is
>>>if the client fails the download. If a server does not support
>>>restartToken, it simply never returns one. Large collections *should*
>>>For my harvester, I can then remember (to disk) the restartToken for
>>>every packet I get back, allowing me to recover much more easily
>>>if anything crashes. If restartToken's are too hard for someone
>>>to implement, then they don't. If you have a large data collection
>>>on the other hand, to reduce network load, I think its probably worth
>>>the extra effort of supporting restartTokens.
>>>Any comments? Better suggesions?
>>>OAI-implementers mailing list
>>Michael L. Nelson
>>NASA Langley Research Center firstname.lastname@example.org
>>MS 158, Hampton, VA 23681 http://www.ils.unc.edu/~mln/
>>+1 757 864 8511 +1 757 864 8342 (f)
>>OAI-implementers mailing list
> Michael L. Nelson
> NASA Langley Research Center email@example.com
> MS 158, Hampton, VA 23681 http://www.ils.unc.edu/~mln/
> +1 757 864 8511 +1 757 864 8342 (f)
> OAI-implementers mailing list
hussein suleman - firstname.lastname@example.org - vtcs - http://www.husseinsspace.com