[OAI-implementers] Better resumption mechanism - more important than ever!

Simeon Warner simeon@cs.cornell.edu
Tue, 5 Mar 2002 10:46:44 -0500 (EST)

I agree with Michael. A repository such as ETDCat should be willing to put
the extra effort in to make harvesting easy and this can be done without
creating any additional barrier for small repositories. I suggest 
that ETDCat should use stateless and reusable resumptionTokens.

The question I see is whether we should come up with a standard way to say
"my resumptionTokens are stateless/reusable". Since this is clearly a
repository-level property the obvious place to put such information would
be in the Identify response. My feeling is that this probably shouldn't be
part of the core protocol so a <description> block would be the best
option. Alternatively it could just become 'standard practice' for
harvesters to attempt to restart a harvest by reusing a resumptionToken
and assuming all is okay if no error is returned.


[aside: in arXiv we have a few days with a significant number of updates,
something like 24k. I build resumptionTokens that just code the from, 
until, set and metadataPrefix parameters for a request to continue the 
harvest. Since I limit the number of records returned to about 1000, this 
means that I have to split the 24k-days. I do this by adding an index 
within the datestamp. e.g.:
- usual form: 1997-04-23_2002-01-01__oai_dc
- with index: 1997-11-13#23774_2002-01-01__oai_dc

On Tue, 5 Mar 2002, Michael L. Nelson wrote:
> actually, the way I see it is the protocol should not be complicated with
> additional tokens and such to enforce what ETDCat (and similarly
> large-sized DPs) should do:
> 1.  partition their collection into sets
> 2.  use stateless (or very long lived) resumptionTokens
> in 2.0, resumptionTokens will have optional attributes, including
> "expirationDate", so this will take the guess work out of knowing how long
> a resumptionToken will be valid.
> IMO, introducing an optional restartToken is no different (from an
> implementer's point of view) than making the resumptionToken last a long
> time.  
> at some point, you (as a harvester) are simply at the mercy of the
> repository.  new features in the protocol won't change that.
> regards,
> Michael
> On Tue, 5 Mar 2002, 'Alan Kent' wrote:
> > I just got some mail from Jeff at OCLC talking about ETDCat (hope
> > you don't mind me quoting some of your mail Jeff). In particular,
> > he just told me
> > 
> >     ETDCat contains a lot of records (over 4 million), all of
> >     which currently have the exact same datestamp from the initial load.
> > 
> > He also told me that there were no sets. So basically, its all
> > or nothing for this site because OAI has no standard way to resume
> > if a transfer fails.
> > 
> > If this has happened already, I think its likely to occur again.
> > (That is, one very large database all with the same time stamp.)
> > So any comments about having a single large collection like this
> > is beside the point. The point is OAI does not handle it well.
> > 
> > So I would like to resurrect the discussion again if people don't
> > mind on how to do support restarts. My understanding of the general
> > feeling so far is
> > 
> > (1) Mandating support is not going to be acceptable
> > 
> > (2) Mandating format of resumption tokens is not going to be acceptable
> > 
> > (3) Mandating resumption tokens be long lifed (eg: can try again the
> >     following day) is not acceptable
> > 
> > (4) In fact, mandating that resumption tokens be unique (allowing
> >     a token to be reused twice in quick succession to get the same
> >     data) is not acceptable
> > 
> > So any proposal needs to be optionally supported.
> > 
> > Question time:
> > 
> > Does anyone else think that this is a major hole in OAI? I personally
> > do. After trying to crawl sites, things go wrong. The larger the site,
> > the greater the probability that something will go wrong. The larger
> > the site, the greater the pain of starting all over again. I do not
> > think it is practical for anyone to harvest ETDCat if is really got
> > 4,000,000 records. Any fault, and start downloading that 4gb again!
> > So I feel strongly on this one. In fact, I think this is the most
> > major problem OAI has.
> > 
> > Do people think its better to reuse resumption tokens for this purpose,
> > or introduce a different sort of token? ETDCat for example I think
> > allocates a session id in resumption tokens, meaning they cannot
> > be reused when the session times out in the server (similar semantics
> > anyway). This is a reasonable implementation decision to make.
> > So maybe its better for servers to return an additional token,
> > which is a <restartToken> which means a client can instead of
> > specifying from= and to= again, specify restartToken= instead where
> > the server then automatically works out whatever other parameters
> > it needs, creates a new session etc internally. The new 'session'
> > (ListXXX verb) then can use resumptionTokens to manage that new
> > transfer.
> > 
> > The idea is for a <restartToken> to be long lifed. It may be less
> > efficient to use than a resumptionToken, but its only purpose is
> > if the client fails the download. If a server does not support
> > restartToken, it simply never returns one. Large collections *should*
> > support restartTokens.
> > 
> > For my harvester, I can then remember (to disk) the restartToken for
> > every packet I get back, allowing me to recover much more easily
> > if anything crashes. If restartToken's are too hard for someone
> > to implement, then they don't. If you have a large data collection
> > on the other hand, to reduce network load, I think its probably worth
> > the extra effort of supporting restartTokens.
> > 
> > Any comments? Better suggesions?
> > 
> > Alan