[OAI-implementers] Better resumption mechanism - more important than ever!

Xiaoming Liu liu_x@cs.odu.edu
Tue, 5 Mar 2002 21:53:54 -0500 (EST)


I think there are two questions here.

1) Could the resumptionToken (in your case restartToken) be re-used? 

I agree the retry algorithem is theoretically unsafe in current protocol,
thanks. However, the same question also exists in "restartToken" and 
must be addressed before we talk about question 2. If they can not be
re-used, the harvester has to start from scratch. It looks like the OAI
1.1 doesn't give clear answer to this question. Hopefully it
could be answsered in 2.0

2) If it is legal to re-use, should we introduce a restartToken concept?

My personal opinion is restartToken will bring too much complexity, and
it's not necessary.

In your case, I could imagine it can be done by current OAI 
resumptionToken: assume the proposed tokens in your suggestion are called
alan_restartToken and  alan_resumptionToken respectively. 

oai_resumptionToken=alan_restartToken + alan_resumptionToken

So data provider (DP) can always parse the oai_resumptionToken, in most
case, the session is valid and DP just uses alan_resumptionToken; if
anything goes wrong, DP need redo the query, DP have the freedom to use
the alan_restartToken. The harvester should not know what happens behind
the scene. At this scenario, the time-to-live could be month, year ;-)

Another small doubt.

>precise date stamps (lets say every ETDCat record has a different
>stamp), because results are not guaranteed to come back in sorted
>order, I cannot restart using from=. I must start again from scratch.

I think it's perfectly correct if you restart using from=. For example, if
you finished everything <=1980, and when you are doing something
"from=1981", system crashed, I think it's correct if you do "from=1981"
again. Because the protocol guarantees you have everything before 1981.


On Wed, 6 Mar 2002, 'Alan Kent' wrote:

> On Tue, Mar 05, 2002 at 10:02:43AM -0500, Michael L. Nelson wrote:
> > actually, the way I see it is the protocol should not be complicated with
> > additional tokens and such to enforce what ETDCat (and similiarly
> > large-sized DPs) should do:
> > 
> > 1.  partition their collection into sets
> I am sorry, but I agree with other's here that sets are not the
> solution. How are the sets going to be created? Are they going
> to have any semantics (or just 1,000 records per set)? What if I
> do want semantics for my sets, but one set does have a 1,000,000
> records? What happens when people start creating even bigger
> collections? Etc. I think sets can be useful, but I would not
> *rely* on them as solving the problem.
> > 2.  use stateless (or very long lived) resumptionTokens
> > 
> > in 2.0, resumptionTokens will have optional attributes, including
> > "expirationDate", so this will take the guess work out of knowing how long
> > a resumptionToken will be valid.
> > 
> > IMO, introducing an optional restartToken is no different (from an
> > implementer's point of view) than making the resumptionToken last a long
> > time.  
> I am going to play devil's advocate a bit here - I think its worth
> teasing out arguments a bit more to make sure they are solid.
> There is a difference, but is the difference worth the complexity
> to the protocol? That is a different question.
> For example, if I was going to build a data supplier implementation
> (I am actually thinking about how it would be done), then I would
> layer it on top of Z39.50 - because that is what our database server
> uses. Z39.50 has a result set concept. So I would do a search,
> then the resumptionToken would be the result set name. If I had
> to make resumptionTokens unique (not currently required I believe),
> then I would add the offset into the result set. Since result
> sets are stored in the server, I might use a timeout of 10 minutes,
> maybe an hour, certainly not a few days. Each result set uses up
> memory in the server! Note that because I have a Z39.50 result,
> I don't need to worry about updates of data in the server.
> My result set won't change in size during the transfer, so I can
> implement idempotent resumptionTokens easily.
> So how to support restarting if something goes wrong? Well, I could
> implement a restartToken which encoded the original request and
> the OAI record identifier I was up to. Note, I would not store the
> result set index. I have to redo the query, the database may have
> changed, so the old index is no longer guaranteed to be correct.
> (I would sort the result sets in the server to make my implementation
> easy). My restart query would be the old from/until stuff, plus an
> addtional 'id >= id-from-restartToken' so the new result set would
> be smaller.
> How long would my restartToken be valid for? I could say months
> or years. How long would my resumptionToken be valid for? minutes
> or hours, not days. Remember that if a transfer fails, my data
> provider code is not sure how long before (or if) the client is
> going to retry. If the harvester says 'help! I need human
> intervension', then the delay could be significant.
> So my *personal* feeling is restartToken's should have a life in
> terms of at least a week. Certainly multiple days. I think this
> might be too much of a restriction on resumption tokens.
> Some other points worth noting:
> * If a server does support long term resumption tokens, then they
>   can return exactly the same string for both resumption tokens
>   and restart tokens. So implementation is not that much harder.
> * It is reasonable for a request using a restart token to return
>   a different set of records (due to database updates) than the
>   old request. It is also reasonable for a server not to return
>   a restart token for every response - it could, for example,
>   return a restart token every time the day or year changes in
>   returned records (if the implementation returns them in order)
>   allowing the harvester to avoid doing *all* the work again,
>   even if some effort is repeated. (ie: more flexibility).
> * Is enhancing the Identify verb response (in a standard way) a
>   good model to move to? It is a real option, and a reasonable one.
>   But so far OAI has not required harvesters to do this sort of
>   look into what the server provides. Do people want to start now?
>   (Phylisophical question here worth asking.) Using restartToken
>   does not require usage of examining Identify responses.
> * For small servers, they do not have to implement restartToken at
>   all. In that case, harvesters just redo the whole request.
>   So this is not mandatory additional code to write.
> * For people who have written code to implement a data provider,
>   how much of a burden is there for resumptionToken's to be valid
>   for a long period of time? (eg: a week). Would a separate
>   restartToken be any use?
> * For data provider programmers again, if the data provider server
>   goes down (eg: shut down nightly for backups or something), will
>   it be easy to make resumption tokens survive across such events?
> * Has OAI 2.0 decreed that resumptionToken's can be reused? (Idempotent)
>   If not, then they cannot be used to recover - unless again something
>   is added to Identify for harvesters to say 'oh, I can try a reload'.
> Taking my horns off for a moment, I also agree that keeping the protocol
> simple is a very good thing.
> But I am not (yet) convinced (oh dear, those horns don't come off that
> easily do they >;^) that that forcing resumptionTokens to have a longer
> life is actually simplifying the job of implementors. And I don't think
> short life resumptionTokens (less than a few days at least) will solve
> the restart problem.  Semantically, to me resumptionToken's are used as
> a protocol mechanism to link multiple packets into a single request.
> RestartTokens are used to recover after a failure by starting a
> completely new request.
> > at some point, you (as a harvester) are simply at the mercy of the
> > repository.  new features in the protocol won't change that.
> That is true, but that does not mean to me that the protocol cannot
> be improved to make the protocol more robust. With OAI as it is,
> I am not going to try and crawl ETDCat any more. Even with more
> precise date stamps (lets say every ETDCat record has a different
> stamp), because results are not guaranteed to come back in sorted
> order, I cannot restart using from=. I must start again from scratch.
> I think the real question is will data provider implementers be
> happy with resumptonToken's lasting for a week. For me *personally*,
> it will be easier having two separate tokens. But I think its wrong
> to design the protocol around my intended implementation (which does
> not even - and may never - exist! :-)
> Alan
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers