[OAI-implementers] Better resumption mechanism - more important than ever!

Michael L. Nelson mln@ils.unc.edu
Tue, 5 Mar 2002 22:42:26 -0500 (EST)


(combining the last 2 emails into 1 respone)

>> I guess this is a major issue that harvester should follow a certain
>> policy. In current implementation in Arc, for each failed request, the
>> harvester will try at most three times using the same http request...

>The problem I have with this is in the current OAI spec as I read
>it, this is not a safe thing to do. OAI does not mandate that
>resumptionToken's change. That is, resumptionToken *could* be
>a session id, with no cursor information in it. Each request
>using the same resumptionToken therefore is permitted to return
>the next N records.

>So I could implement a data provider where the above algorithm is
>wrong and will silently deliver incorrect results.

in OAI 2.0, resumptionTokens are slated to have the optional attributes
of:

- resumeAfter (how long to wait before issuing the next request)
- expirationDate (TTL for the resumptionToken)
- completeListSize (total # of records in the complete list)
- cursor (number of items returned so far)

granted, a repository does not have to implement these, but it the
concerns we are talking about behooves them to.

so if you re-issue a resumptionToken, and completeListSize or cursor are
not what you expect, then something has gone wrong.  and if the
resumptionToken is past its expiration, the repository will respond with a
badResumptionToken error.

>Does OAI 2.0 say that resumptionToken's must be unique within
>a download? And that reusing an old resumptionToken must be
>supported (or rejected with an error)? If not guaranteed by
>the spec, then I would not want to write a harvester relying
>on it. I would rather spend the effort and get the spec right
>rather than having to come to agreements with individual data
>providers.

I don't think the spec currently requires that a repository reject expired
resumptionTokens, but a harvester would be wise not to use them if they
are expired.  its like drinking milk a day or two after the expiration
date:  its *probably* ok, but you gotta be pretty thirsty to do it.

>One approach discussed was of course to add something to the
>Identify response allowing servers to advertise 'its safe
>to reuse resumptionTokens'. More in a later mail.

agreed, below in my mesg also.

>(I am not proposing anything in this mail - just saying I believe
>the above retry algorithm is theoretically unsafe.)

> On Tue, Mar 05, 2002 at 10:02:43AM -0500, Michael L. Nelson wrote:
> > actually, the way I see it is the protocol should not be complicated with
> > additional tokens and such to enforce what ETDCat (and similiarly
> > large-sized DPs) should do:
> > 
> > 1.  partition their collection into sets
> 
> I am sorry, but I agree with other's here that sets are not the
> solution. How are the sets going to be created? Are they going
> to have any semantics (or just 1,000 records per set)? What if I
> do want semantics for my sets, but one set does have a 1,000,000
> records? 

1M < 4M

> What happens when people start creating even bigger
> collections? Etc. I think sets can be useful, but I would not
> *rely* on them as solving the problem.
> 

I never meant to imply that they should be relied on, only that if they
exist, they can make harvesting easier.  I'll defer further set discussion
since the main topic is resumptionToken mechanics (sets can chew up a
whole other thread ;-)

> > 2.  use stateless (or very long lived) resumptionTokens
> > 
> > in 2.0, resumptionTokens will have optional attributes, including
> > "expirationDate", so this will take the guess work out of knowing how long
> > a resumptionToken will be valid.
> > 
> > IMO, introducing an optional restartToken is no different (from an
> > implementer's point of view) than making the resumptionToken last a long
> > time.  
> 
> I am going to play devil's advocate a bit here - I think its worth
> teasing out arguments a bit more to make sure they are solid.
> 
> There is a difference, but is the difference worth the complexity
> to the protocol? That is a different question.

I'll rephrase my answer:  the repository can implement it so there is no
difference.

> 
> For example, if I was going to build a data supplier implementation
> (I am actually thinking about how it would be done), then I would
> layer it on top of Z39.50 - because that is what our database server
> uses. Z39.50 has a result set concept. So I would do a search,
> then the resumptionToken would be the result set name. If I had
> to make resumptionTokens unique (not currently required I believe),
> then I would add the offset into the result set. Since result
> sets are stored in the server, I might use a timeout of 10 minutes,
> maybe an hour, certainly not a few days. Each result set uses up
> memory in the server! Note that because I have a Z39.50 result,

this is an artifact of your implementation... write the result set out to
disk and set the expirationDate to a few days.  add a reasonable response
caching algorithm, and you could end up with a huge performance
win.  Depending on the DP accession rate, harvesting patterns, etc., your
mileage could vary, but I suspect it would be very good.

> I don't need to worry about updates of data in the server.
> My result set won't change in size during the transfer, so I can
> implement idempotent resumptionTokens easily.
> 
> So how to support restarting if something goes wrong? Well, I could
> implement a restartToken which encoded the original request and
> the OAI record identifier I was up to. Note, I would not store the
> result set index. I have to redo the query, the database may have
> changed, so the old index is no longer guaranteed to be correct.
> (I would sort the result sets in the server to make my implementation
> easy). My restart query would be the old from/until stuff, plus an
> addtional 'id >= id-from-restartToken' so the new result set would
> be smaller.

I think this is all doable with resumptionTokens.  As Liu suggest in his
email, you can embed your restartTokens in your resumptionTokens.

> 
> How long would my restartToken be valid for? I could say months
> or years. How long would my resumptionToken be valid for? minutes
> or hours, not days. Remember that if a transfer fails, my data
> provider code is not sure how long before (or if) the client is
> going to retry. If the harvester says 'help! I need human
> intervension', then the delay could be significant.
> 
> So my *personal* feeling is restartToken's should have a life in
> terms of at least a week. Certainly multiple days. I think this
> might be too much of a restriction on resumption tokens.
> 
> 
> Some other points worth noting:
> 
> * If a server does support long term resumption tokens, then they
>   can return exactly the same string for both resumption tokens
>   and restart tokens. So implementation is not that much harder.
> 
> * It is reasonable for a request using a restart token to return
>   a different set of records (due to database updates) than the
>   old request. It is also reasonable for a server not to return
>   a restart token for every response - it could, for example,
>   return a restart token every time the day or year changes in
>   returned records (if the implementation returns them in order)
>   allowing the harvester to avoid doing *all* the work again,
>   even if some effort is repeated. (ie: more flexibility).
> 
> * Is enhancing the Identify verb response (in a standard way) a
>   good model to move to? It is a real option, and a reasonable one.
>   But so far OAI has not required harvesters to do this sort of
>   look into what the server provides. Do people want to start now?
>   (Phylisophical question here worth asking.) Using restartToken
>   does not require usage of examining Identify responses.

2.0 will already have more machine processable information in the Identify
response.  I'm not sure there is a good way around it, and since that
door is already open, if you want to provide hints about how your
resumptionTokens are used/implemented, that's surely ok.

> 
> * For small servers, they do not have to implement restartToken at
>   all. In that case, harvesters just redo the whole request.
>   So this is not mandatory additional code to write.
> 
> * For people who have written code to implement a data provider,
>   how much of a burden is there for resumptionToken's to be valid
>   for a long period of time? (eg: a week). Would a separate
>   restartToken be any use?

my DPs use stateless resumptionTokens, resuming and restarting are the
similar.

> 
> * For data provider programmers again, if the data provider server
>   goes down (eg: shut down nightly for backups or something), will
>   it be easy to make resumption tokens survive across such events?

I would implement stateful resumptionTokens in a disk cache, so recovery
would not be a problem.

> 
> * Has OAI 2.0 decreed that resumptionToken's can be reused? (Idempotent)
>   If not, then they cannot be used to recover - unless again something
>   is added to Identify for harvesters to say 'oh, I can try a reload'.
> 

good point...  perhaps if a DP sets an expirationDate, it should
idempotenly (?!) honor the resumptionToken until that date...  hmm...  

this would be a good thing when the response to the harvester is lost, and
the harvester reissues a request with the same resumptionToken...  hmm...

> 
> Taking my horns off for a moment, I also agree that keeping the protocol
> simple is a very good thing.
> 
> But I am not (yet) convinced (oh dear, those horns don't come off that
> easily do they >;^) that that forcing resumptionTokens to have a longer
> life is actually simplifying the job of implementors. And I don't think
> short life resumptionTokens (less than a few days at least) will solve
> the restart problem.  Semantically, to me resumptionToken's are used as
> a protocol mechanism to link multiple packets into a single request.
> RestartTokens are used to recover after a failure by starting a
> completely new request.

We allow the DP to choose the syntax of the resumptionToken, and as this
discussion has revealed, there is some wiggle room in allowing them to
choose (perhaps "nudge" is better) the semantics as well.  Ultimately, I
think this is a good thing.  Most DPs won't need all of this.  But the
ones that want to implement a certain policy or effect can do so.

> 
> > at some point, you (as a harvester) are simply at the mercy of the
> > repository.  new features in the protocol won't change that.
> 
> That is true, but that does not mean to me that the protocol cannot
> be improved to make the protocol more robust. With OAI as it is,
> I am not going to try and crawl ETDCat any more. Even with more
> precise date stamps (lets say every ETDCat record has a different
> stamp), because results are not guaranteed to come back in sorted
> order, I cannot restart using from=. I must start again from scratch.

but if their resumptionTokens had a long life, and were idempotent within
that lifetime you would not have to start from scratch.  2.0 will allow
the specification of the former, and we should probably discuss the latter
some more.

> 
> I think the real question is will data provider implementers be
> happy with resumptonToken's lasting for a week. For me *personally*,
> it will be easier having two separate tokens. But I think its wrong
> to design the protocol around my intended implementation (which does
> not even - and may never - exist! :-)

you better build your system after all this!  ;-)

seriously, you bring up a lot of good points.  a lot of this exchange
should probably be reflected in the implementation guide that will
accompany the protocol doc.

regards,

Michael

> 
> Alan
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> 

---
Michael L. Nelson
NASA Langley Research Center		m.l.nelson@larc.nasa.gov
MS 158, Hampton, VA 23681		http://www.ils.unc.edu/~mln/
+1 757 864 8511				+1 757 864 8342 (f)