[OAI-implementers] Better resumption mechanism - more important than ever!

Xiaoming Liu liu_x@cs.odu.edu
Tue, 5 Mar 2002 11:36:23 -0500 (EST)


On Tue, 5 Mar 2002, Michael L. Nelson wrote:

> 
> actually, the way I see it is the protocol should not be complicated with
> additional tokens and such to enforce what ETDCat (and similiarly
> large-sized DPs) should do:
> 
> 1.  partition their collection into sets


I agree all with Michael except the sets point. As OAI doesn't guarantee
that a harvester will get everything if it harvests by sets.

So the only possibility maybe stateless resumptionTokens ( implemented by
sorted fine-grained datestamp).

regards,
liu
 


> 2.  use stateless (or very long lived) resumptionTokens
> 
> in 2.0, resumptionTokens will have optional attributes, including
> "expirationDate", so this will take the guess work out of knowing how long
> a resumptionToken will be valid.
> 
> IMO, introducing an optional restartToken is no different (from an
> implementer's point of view) than making the resumptionToken last a long
> time.  
> 
> at some point, you (as a harvester) are simply at the mercy of the
> repository.  new features in the protocol won't change that.
> 
> regards,
> 
> Michael
> 
> On Tue, 5 Mar 2002, 'Alan Kent' wrote:
> 
> > I just got some mail from Jeff at OCLC talking about ETDCat (hope
> > you don't mind me quoting some of your mail Jeff). In particular,
> > he just told me
> > 
> >     ETDCat contains a lot of records (over 4 million), all of
> >     which currently have the exact same datestamp from the initial load.
> > 
> > He also told me that there were no sets. So basically, its all
> > or nothing for this site because OAI has no standard way to resume
> > if a transfer fails.
> > 
> > If this has happened already, I think its likely to occur again.
> > (That is, one very large database all with the same time stamp.)
> > So any comments about having a single large collection like this
> > is beside the point. The point is OAI does not handle it well.
> > 
> > So I would like to resurrect the discussion again if people don't
> > mind on how to do support restarts. My understanding of the general
> > feeling so far is
> > 
> > (1) Mandating support is not going to be acceptable
> > 
> > (2) Mandating format of resumption tokens is not going to be acceptable
> > 
> > (3) Mandating resumption tokens be long lifed (eg: can try again the
> >     following day) is not acceptable
> > 
> > (4) In fact, mandating that resumption tokens be unique (allowing
> >     a token to be reused twice in quick succession to get the same
> >     data) is not acceptable
> > 
> > So any proposal needs to be optionally supported.
> > 
> > Question time:
> > 
> > Does anyone else think that this is a major hole in OAI? I personally
> > do. After trying to crawl sites, things go wrong. The larger the site,
> > the greater the probability that something will go wrong. The larger
> > the site, the greater the pain of starting all over again. I do not
> > think it is practical for anyone to harvest ETDCat if is really got
> > 4,000,000 records. Any fault, and start downloading that 4gb again!
> > So I feel strongly on this one. In fact, I think this is the most
> > major problem OAI has.
> > 
> > Do people think its better to reuse resumption tokens for this purpose,
> > or introduce a different sort of token? ETDCat for example I think
> > allocates a session id in resumption tokens, meaning they cannot
> > be reused when the session times out in the server (similar semantics
> > anyway). This is a reasonable implementation decision to make.
> > So maybe its better for servers to return an additional token,
> > which is a <restartToken> which means a client can instead of
> > specifying from= and to= again, specify restartToken= instead where
> > the server then automatically works out whatever other parameters
> > it needs, creates a new session etc internally. The new 'session'
> > (ListXXX verb) then can use resumptionTokens to manage that new
> > transfer.
> > 
> > The idea is for a <restartToken> to be long lifed. It may be less
> > efficient to use than a resumptionToken, but its only purpose is
> > if the client fails the download. If a server does not support
> > restartToken, it simply never returns one. Large collections *should*
> > support restartTokens.
> > 
> > For my harvester, I can then remember (to disk) the restartToken for
> > every packet I get back, allowing me to recover much more easily
> > if anything crashes. If restartToken's are too hard for someone
> > to implement, then they don't. If you have a large data collection
> > on the other hand, to reduce network load, I think its probably worth
> > the extra effort of supporting restartTokens.
> > 
> > Any comments? Better suggesions?
> > 
> > Alan
> > _______________________________________________
> > OAI-implementers mailing list
> > OAI-implementers@oaisrv.nsdl.cornell.edu
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> > 
> 
> ---
> Michael L. Nelson
> NASA Langley Research Center		m.l.nelson@larc.nasa.gov
> MS 158, Hampton, VA 23681		http://www.ils.unc.edu/~mln/
> +1 757 864 8511				+1 757 864 8342 (f)
> 
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>