[OAI-implementers] Better resumption mechanism - more importa nt than ever!

Young,Jeff jyoung@oclc.org
Tue, 5 Mar 2002 13:36:49 -0500


I'd be very disappointed if ETDCat required custom and unique consideration
from harvesters merely because of its size. Partitioning the collection
would be a case in point. The implication seems to be that harvesters would
somehow know to query the list of sets and then loop through each of them.
How would an arbitrary harvester know to do that, and is their software even
capable of it without custom coding? It would also prevent me from using
sets for legitimate purposes since I couldn't distinguish between them.

I'd be happy to implement stateless resumptionTokens, but unless harvesters
know how to use them for recovery, why bother? How many harvesters today
could manage a recovery using stateless resumptionTokens? How many
harvesters will handle it tomorrow if OAI remains agnostic on the issue?

I'm sure ETDCat needs more stress testing to minimize future failures. The
fact that we've discussed this before, though, indicates a recognition that
problems can happen. I don't blame Alan if he doesn't want to negotiate
special rules for harvesting ETDCat merely because the risk is proportional
to the size of the repository.

Jeff

-----Original Message-----
From: Michael L. Nelson [mailto:mln@ils.unc.edu]
Sent: Tuesday, March 05, 2002 10:03 AM
To: 'Alan Kent'
Cc: OAI Implementors
Subject: Re: [OAI-implementers] Better resumption mechanism - more
important than ever!



actually, the way I see it is the protocol should not be complicated with
additional tokens and such to enforce what ETDCat (and similiarly
large-sized DPs) should do:

1.  partition their collection into sets
2.  use stateless (or very long lived) resumptionTokens

in 2.0, resumptionTokens will have optional attributes, including
"expirationDate", so this will take the guess work out of knowing how long
a resumptionToken will be valid.

IMO, introducing an optional restartToken is no different (from an
implementer's point of view) than making the resumptionToken last a long
time.  

at some point, you (as a harvester) are simply at the mercy of the
repository.  new features in the protocol won't change that.

regards,

Michael

On Tue, 5 Mar 2002, 'Alan Kent' wrote:

> I just got some mail from Jeff at OCLC talking about ETDCat (hope
> you don't mind me quoting some of your mail Jeff). In particular,
> he just told me
> 
>     ETDCat contains a lot of records (over 4 million), all of
>     which currently have the exact same datestamp from the initial load.
> 
> He also told me that there were no sets. So basically, its all
> or nothing for this site because OAI has no standard way to resume
> if a transfer fails.
> 
> If this has happened already, I think its likely to occur again.
> (That is, one very large database all with the same time stamp.)
> So any comments about having a single large collection like this
> is beside the point. The point is OAI does not handle it well.
> 
> So I would like to resurrect the discussion again if people don't
> mind on how to do support restarts. My understanding of the general
> feeling so far is
> 
> (1) Mandating support is not going to be acceptable
> 
> (2) Mandating format of resumption tokens is not going to be acceptable
> 
> (3) Mandating resumption tokens be long lifed (eg: can try again the
>     following day) is not acceptable
> 
> (4) In fact, mandating that resumption tokens be unique (allowing
>     a token to be reused twice in quick succession to get the same
>     data) is not acceptable
> 
> So any proposal needs to be optionally supported.
> 
> Question time:
> 
> Does anyone else think that this is a major hole in OAI? I personally
> do. After trying to crawl sites, things go wrong. The larger the site,
> the greater the probability that something will go wrong. The larger
> the site, the greater the pain of starting all over again. I do not
> think it is practical for anyone to harvest ETDCat if is really got
> 4,000,000 records. Any fault, and start downloading that 4gb again!
> So I feel strongly on this one. In fact, I think this is the most
> major problem OAI has.
> 
> Do people think its better to reuse resumption tokens for this purpose,
> or introduce a different sort of token? ETDCat for example I think
> allocates a session id in resumption tokens, meaning they cannot
> be reused when the session times out in the server (similar semantics
> anyway). This is a reasonable implementation decision to make.
> So maybe its better for servers to return an additional token,
> which is a <restartToken> which means a client can instead of
> specifying from= and to= again, specify restartToken= instead where
> the server then automatically works out whatever other parameters
> it needs, creates a new session etc internally. The new 'session'
> (ListXXX verb) then can use resumptionTokens to manage that new
> transfer.
> 
> The idea is for a <restartToken> to be long lifed. It may be less
> efficient to use than a resumptionToken, but its only purpose is
> if the client fails the download. If a server does not support
> restartToken, it simply never returns one. Large collections *should*
> support restartTokens.
> 
> For my harvester, I can then remember (to disk) the restartToken for
> every packet I get back, allowing me to recover much more easily
> if anything crashes. If restartToken's are too hard for someone
> to implement, then they don't. If you have a large data collection
> on the other hand, to reduce network load, I think its probably worth
> the extra effort of supporting restartTokens.
> 
> Any comments? Better suggesions?
> 
> Alan
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> 

---
Michael L. Nelson
NASA Langley Research Center		m.l.nelson@larc.nasa.gov
MS 158, Hampton, VA 23681		http://www.ils.unc.edu/~mln/
+1 757 864 8511				+1 757 864 8342 (f)


_______________________________________________
OAI-implementers mailing list
OAI-implementers@oaisrv.nsdl.cornell.edu
http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers