[OAI-implementers] Better resumption mechanism - more importa nt than ever!

Simeon Warner simeon@cs.cornell.edu
Tue, 5 Mar 2002 14:48:43 -0500 (EST)


On Tue, 5 Mar 2002, Xiaoming Liu wrote:
> On Tue, 5 Mar 2002, Young,Jeff wrote:
> > I'd be happy to implement stateless resumptionTokens, but unless harvesters
> > know how to use them for recovery, why bother? How many harvesters today
> > could manage a recovery using stateless resumptionTokens? How many
> > harvesters will handle it tomorrow if OAI remains agnostic on the issue?
> 
> I guess this is a major issue that harvester should follow a certain
> policy. In current implementation in Arc, for each failed request, the
> harvester will try at most three times using the same http request. And it
> will give up after that. This policy really helps several times, but not
> too often ;-)

Liu, your policy is the sort of thing I had imagined. However, I'm curious
about how frequently you find that a sequence of harvests fails. When I
last did an extensive harvest (last summer) I found that, provided
repositories had implemented the protocol properly, I rarely had problems
getting successful responses to complete a List request. Can you give us
some (approximate) statistics?
 
> Ideally, I guess a harvester could use exponential backoff algorithm to
> keep trying until the resumptionToken is expired (Considering a
> time-to-live parameter will be added in 2.0). And if we implment the
> harvester in a multiple process/thread way, the system should scale well
> for several resumptionToken errors.
> 
> I think something like "implementation guide" or "reference
> implementation" will help harvester and DP understand each other well
> beyond the core protocol.

Yes, this should certainly be covered in the implementation guidelines.

Cheers,
Simeon.
 

> regards,
> liu
> > 
> > I'm sure ETDCat needs more stress testing to minimize future failures. The
> > fact that we've discussed this before, though, indicates a recognition that
> > problems can happen. I don't blame Alan if he doesn't want to negotiate
> > special rules for harvesting ETDCat merely because the risk is proportional
> > to the size of the repository.
> > 
> > Jeff
> > 
> > -----Original Message-----
> > From: Michael L. Nelson [mailto:mln@ils.unc.edu]
> > Sent: Tuesday, March 05, 2002 10:03 AM
> > To: 'Alan Kent'
> > Cc: OAI Implementors
> > Subject: Re: [OAI-implementers] Better resumption mechanism - more
> > important than ever!
> > 
> > actually, the way I see it is the protocol should not be complicated with
> > additional tokens and such to enforce what ETDCat (and similiarly
> > large-sized DPs) should do:
> > 
> > 1.  partition their collection into sets
> > 2.  use stateless (or very long lived) resumptionTokens
> > 
> > in 2.0, resumptionTokens will have optional attributes, including
> > "expirationDate", so this will take the guess work out of knowing how long
> > a resumptionToken will be valid.
> > 
> > IMO, introducing an optional restartToken is no different (from an
> > implementer's point of view) than making the resumptionToken last a long
> > time.  
> > 
> > at some point, you (as a harvester) are simply at the mercy of the
> > repository.  new features in the protocol won't change that.
> > 
> > regards,
> > 
> > Michael
> > 
> > On Tue, 5 Mar 2002, 'Alan Kent' wrote:
> > 
> > > I just got some mail from Jeff at OCLC talking about ETDCat (hope
> > > you don't mind me quoting some of your mail Jeff). In particular,
> > > he just told me
> > > 
> > >     ETDCat contains a lot of records (over 4 million), all of
> > >     which currently have the exact same datestamp from the initial load.
> > > 
> > > He also told me that there were no sets. So basically, its all
> > > or nothing for this site because OAI has no standard way to resume
> > > if a transfer fails.
> > > 
> > > If this has happened already, I think its likely to occur again.
> > > (That is, one very large database all with the same time stamp.)
> > > So any comments about having a single large collection like this
> > > is beside the point. The point is OAI does not handle it well.
> > > 
> > > So I would like to resurrect the discussion again if people don't
> > > mind on how to do support restarts. My understanding of the general
> > > feeling so far is
> > > 
> > > (1) Mandating support is not going to be acceptable
> > > 
> > > (2) Mandating format of resumption tokens is not going to be acceptable
> > > 
> > > (3) Mandating resumption tokens be long lifed (eg: can try again the
> > >     following day) is not acceptable
> > > 
> > > (4) In fact, mandating that resumption tokens be unique (allowing
> > >     a token to be reused twice in quick succession to get the same
> > >     data) is not acceptable
> > > 
> > > So any proposal needs to be optionally supported.
> > > 
> > > Question time:
> > > 
> > > Does anyone else think that this is a major hole in OAI? I personally
> > > do. After trying to crawl sites, things go wrong. The larger the site,
> > > the greater the probability that something will go wrong. The larger
> > > the site, the greater the pain of starting all over again. I do not
> > > think it is practical for anyone to harvest ETDCat if is really got
> > > 4,000,000 records. Any fault, and start downloading that 4gb again!
> > > So I feel strongly on this one. In fact, I think this is the most
> > > major problem OAI has.
> > > 
> > > Do people think its better to reuse resumption tokens for this purpose,
> > > or introduce a different sort of token? ETDCat for example I think
> > > allocates a session id in resumption tokens, meaning they cannot
> > > be reused when the session times out in the server (similar semantics
> > > anyway). This is a reasonable implementation decision to make.
> > > So maybe its better for servers to return an additional token,
> > > which is a <restartToken> which means a client can instead of
> > > specifying from= and to= again, specify restartToken= instead where
> > > the server then automatically works out whatever other parameters
> > > it needs, creates a new session etc internally. The new 'session'
> > > (ListXXX verb) then can use resumptionTokens to manage that new
> > > transfer.
> > > 
> > > The idea is for a <restartToken> to be long lifed. It may be less
> > > efficient to use than a resumptionToken, but its only purpose is
> > > if the client fails the download. If a server does not support
> > > restartToken, it simply never returns one. Large collections *should*
> > > support restartTokens.
> > > 
> > > For my harvester, I can then remember (to disk) the restartToken for
> > > every packet I get back, allowing me to recover much more easily
> > > if anything crashes. If restartToken's are too hard for someone
> > > to implement, then they don't. If you have a large data collection
> > > on the other hand, to reduce network load, I think its probably worth
> > > the extra effort of supporting restartTokens.
> > > 
> > > Any comments? Better suggesions?
> > > 
> > > Alan
> > > _______________________________________________
> > > OAI-implementers mailing list
> > > OAI-implementers@oaisrv.nsdl.cornell.edu
> > > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> > > 
> > 
> > ---
> > Michael L. Nelson
> > NASA Langley Research Center		m.l.nelson@larc.nasa.gov
> > MS 158, Hampton, VA 23681		http://www.ils.unc.edu/~mln/
> > +1 757 864 8511				+1 757 864 8342 (f)
> > 
> > 
> > _______________________________________________
> > OAI-implementers mailing list
> > OAI-implementers@oaisrv.nsdl.cornell.edu
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> > _______________________________________________
> > OAI-implementers mailing list
> > OAI-implementers@oaisrv.nsdl.cornell.edu
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> > 
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>