[OAI-implementers] Better resumption mechanism - more important than ever!

Wed, 6 Mar 2002 18:11:38 +1100

On Tue, Mar 05, 2002 at 10:42:26PM -0500, Michael L. Nelson wrote:
> >Does OAI 2.0 say that resumptionToken's must be unique within
> >a download? And that reusing an old resumptionToken must be
> >supported (or rejected with an error)? If not guaranteed by
> >the spec, then I would not want to write a harvester relying
> >on it. I would rather spend the effort and get the spec right
> >rather than having to come to agreements with individual data
> >providers.
> 
> I don't think the spec currently requires that a repository reject expired
> resumptionTokens, but a harvester would be wise not to use them if they
> are expired.  its like drinking milk a day or two after the expiration
> date:  its *probably* ok, but you gotta be pretty thirsty to do it.

This is not quite what I meant. Its more if my resumption token is
just a result set name and the DP remembers where the transfer is
up to, the reusing the same token would get the next N records.
The current spec allows this.

I think OAI 2.0 should allow a DP to advertise (via Identify?) that
it resumptionToken's can be reused (are idempotent) to retry.
That would satisfy me.

> > There is a difference, but is the difference worth the complexity
> > to the protocol? That is a different question.
> 
> I'll rephrase my answer:  the repository can implement it so there is no
> difference.

I agree that a DP can implement (idempotent resumption tokens), but how
does a harvester know that the DP has implemented it? Either OAI 2.0
must mandate it (possibly overly restrictive for smaller repositories),
or DP's must be able advertise it in a standard way, such as in the
Identify response.

So not much needs to be done, but something does need to be done.
The 1.1 spec at present is not enough.

> this is an artifact of your implementation... write the result set out to
> disk and set the expirationDate to a few days.  add a reasonable response
> caching algorithm, and you could end up with a huge performance
> win.  Depending on the DP accession rate, harvesting patterns, etc., your
> mileage could vary, but I suspect it would be very good.

I would never write the result set out to disk. For a very large
result set (eg: 10,000,000 records), I would have to fetch all the
records (lots of disk accesses) get their OAI-id's, then start
transferring. Then how long to keep the temporary file around for?
How many people might be doing transfers at the same time?
(A Z39.50 result set is not a client side data structure, but
a server side data structure by the way.)

But Liu had a good solution - just store both what I called resumptionToken
and restartToken in the resumptionToken. Ie: the result set name and
query. If the result set has timed out, use the query part and build
it up again. Its up to me to get it correct. I personally would
have problems with cursors and list sizes (I would not support them
because if I redid the query, the result set size may change and
so both the size and cursor would be invalidated). But I can munge
my own DP implementation stuff in there to do something pretty similar
(my own internal concept of a 'cursor').

> 2.0 will already have more machine processable information in the Identify
> response.  I'm not sure there is a good way around it, and since that
> door is already open, if you want to provide hints about how your
> resumptionTokens are used/implemented, that's surely ok.

Ok, then I think advertise a little more about idempotency of
resumptionToken's and everything is fine. Implementors for large
repositories should try to have long time-to-live for resumption
tokens, but no protocol change is required.

> but if their resumptionTokens had a long life, and were idempotent within
> that lifetime you would not have to start from scratch.  2.0 will allow
> the specification of the former, and we should probably discuss the latter
> some more.

Agreed. The simplest solution is (as above) to allow a server to advertise
its resumptionToken's are idempotent.

> you better build your system after all this!  ;-)

*-)

One problem is I dont have any data to export - only data that other
people have made available. The other problem relates to number of
hours in the day. I still want to put my harvested collection up
for public access to if I can scrounge up the disk space.

> seriously, you bring up a lot of good points.  a lot of this exchange
> should probably be reflected in the implementation guide that will
> accompany the protocol doc.

I think the conclusions, such as 'advertise idempontency, and make resumption
tokens long lifed to handle where a harvester hits a problem and waits
for a humam to try and keep going' are worth documenting, not the rest.
There are always the mail archives.

Alan