[OAI-implementers] Better resumption mechanism - more importa nt than ever!

Tue, 5 Mar 2002 13:59:22 -0500 (EST)

On Tue, 5 Mar 2002, Young,Jeff wrote:

> I'd be very disappointed if ETDCat required custom and unique consideration
> from harvesters merely because of its size. Partitioning the collection
> would be a case in point. The implication seems to be that harvesters would
> somehow know to query the list of sets and then loop through each of them.
> How would an arbitrary harvester know to do that, and is their software even
> capable of it without custom coding? It would also prevent me from using
> sets for legitimate purposes since I couldn't distinguish between them.

my original point about sets was not intended to be the primary point.  if
ETDCat already uses sets, then Alan's harvester should investigate
harvesting by set, since it will naturally partion the collection (modulo
Liu's point about not all records are guaranteed to be in sets).

> 
> I'd be happy to implement stateless resumptionTokens, but unless harvesters
> know how to use them for recovery, why bother? How many harvesters today
> could manage a recovery using stateless resumptionTokens? How many
> harvesters will handle it tomorrow if OAI remains agnostic on the issue?
> 

I'm not sure, but I would guess this would be the default behaivor.  If
the harvester chokes, I would start again where it left off.  If
successful harvesting continues, then there was a transient error.  If it
fails again, then maybe the repository has a problem.

In 2.0, it will be even easier to determine:  

- when the resumptionToken expires
- how big the result set is
- and how many records the repository has transmitted so far

Any harvester writers our there care to comment?  Liu?  Hussein?

> I'm sure ETDCat needs more stress testing to minimize future failures. The
> fact that we've discussed this before, though, indicates a recognition that
> problems can happen. I don't blame Alan if he doesn't want to negotiate
> special rules for harvesting ETDCat merely because the risk is proportional
> to the size of the repository.

(I hope no one interprets my comments as beating up on ETDCat)

These are interesting points:  to the best of my knowledge, if ETDCat has
4M records, its by and far the biggest OAI repository out there.  My point
is that 4M of anything is a big number, and repositories that large need
to make sure they implement features that facilitate fault-tolerant
harvesting.  Stateless (or very long lived) resumptionTokens would appear
to be one of those features.  Also, if you have 4M records all with the
same datestamp, this would seem to be an ideal candidate for some response
caching techniques, which tied with very long lived (2-3 days?)
resumptionTokens would seem to make for an efficent load on your end.

regards,

Michael

> 
> Jeff
> 
> -----Original Message-----
> From: Michael L. Nelson [mailto:mln@ils.unc.edu]
> Sent: Tuesday, March 05, 2002 10:03 AM
> To: 'Alan Kent'
> Cc: OAI Implementors
> Subject: Re: [OAI-implementers] Better resumption mechanism - more
> important than ever!
> 
> 
> 
> actually, the way I see it is the protocol should not be complicated with
> additional tokens and such to enforce what ETDCat (and similiarly
> large-sized DPs) should do:
> 
> 1.  partition their collection into sets
> 2.  use stateless (or very long lived) resumptionTokens
> 
> in 2.0, resumptionTokens will have optional attributes, including
> "expirationDate", so this will take the guess work out of knowing how long
> a resumptionToken will be valid.
> 
> IMO, introducing an optional restartToken is no different (from an
> implementer's point of view) than making the resumptionToken last a long
> time.  
> 
> at some point, you (as a harvester) are simply at the mercy of the
> repository.  new features in the protocol won't change that.
> 
> regards,
> 
> Michael
> 
> On Tue, 5 Mar 2002, 'Alan Kent' wrote:
> 
> > I just got some mail from Jeff at OCLC talking about ETDCat (hope
> > you don't mind me quoting some of your mail Jeff). In particular,
> > he just told me
> > 
> >     ETDCat contains a lot of records (over 4 million), all of
> >     which currently have the exact same datestamp from the initial load.
> > 
> > He also told me that there were no sets. So basically, its all
> > or nothing for this site because OAI has no standard way to resume
> > if a transfer fails.
> > 
> > If this has happened already, I think its likely to occur again.
> > (That is, one very large database all with the same time stamp.)
> > So any comments about having a single large collection like this
> > is beside the point. The point is OAI does not handle it well.
> > 
> > So I would like to resurrect the discussion again if people don't
> > mind on how to do support restarts. My understanding of the general
> > feeling so far is
> > 
> > (1) Mandating support is not going to be acceptable
> > 
> > (2) Mandating format of resumption tokens is not going to be acceptable
> > 
> > (3) Mandating resumption tokens be long lifed (eg: can try again the
> >     following day) is not acceptable
> > 
> > (4) In fact, mandating that resumption tokens be unique (allowing
> >     a token to be reused twice in quick succession to get the same
> >     data) is not acceptable
> > 
> > So any proposal needs to be optionally supported.
> > 
> > Question time:
> > 
> > Does anyone else think that this is a major hole in OAI? I personally
> > do. After trying to crawl sites, things go wrong. The larger the site,
> > the greater the probability that something will go wrong. The larger
> > the site, the greater the pain of starting all over again. I do not
> > think it is practical for anyone to harvest ETDCat if is really got
> > 4,000,000 records. Any fault, and start downloading that 4gb again!
> > So I feel strongly on this one. In fact, I think this is the most
> > major problem OAI has.
> > 
> > Do people think its better to reuse resumption tokens for this purpose,
> > or introduce a different sort of token? ETDCat for example I think
> > allocates a session id in resumption tokens, meaning they cannot
> > be reused when the session times out in the server (similar semantics
> > anyway). This is a reasonable implementation decision to make.
> > So maybe its better for servers to return an additional token,
> > which is a <restartToken> which means a client can instead of
> > specifying from= and to= again, specify restartToken= instead where
> > the server then automatically works out whatever other parameters
> > it needs, creates a new session etc internally. The new 'session'
> > (ListXXX verb) then can use resumptionTokens to manage that new
> > transfer.
> > 
> > The idea is for a <restartToken> to be long lifed. It may be less
> > efficient to use than a resumptionToken, but its only purpose is
> > if the client fails the download. If a server does not support
> > restartToken, it simply never returns one. Large collections *should*
> > support restartTokens.
> > 
> > For my harvester, I can then remember (to disk) the restartToken for
> > every packet I get back, allowing me to recover much more easily
> > if anything crashes. If restartToken's are too hard for someone
> > to implement, then they don't. If you have a large data collection
> > on the other hand, to reduce network load, I think its probably worth
> > the extra effort of supporting restartTokens.
> > 
> > Any comments? Better suggesions?
> > 
> > Alan
> > _______________________________________________
> > OAI-implementers mailing list
> > OAI-implementers@oaisrv.nsdl.cornell.edu
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> > 
> 
> ---
> Michael L. Nelson
> NASA Langley Research Center		m.l.nelson@larc.nasa.gov
> MS 158, Hampton, VA 23681		http://www.ils.unc.edu/~mln/
> +1 757 864 8511				+1 757 864 8342 (f)
> 
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> 

---
Michael L. Nelson
NASA Langley Research Center		m.l.nelson@larc.nasa.gov
MS 158, Hampton, VA 23681		http://www.ils.unc.edu/~mln/
+1 757 864 8511				+1 757 864 8342 (f)