[OAI-implementers] resumptionToken Implementation

Tue Sep 28 18:32:42 EDT 2004

After reading the couple of responses, my first inclination is to;

1) Store the parameters and regenerate the query when the resumption 
token is received.

But what I think I will also add is to store the total number of 
results when the query was first run. If the current total is different 
from that number, it invalidates the token and tells the harvester. 
This is the best way I can think of to deal with this issue.....

Any other suggestions would be certainly entertained.......

On Sep 28, 2004, at 3:03 PM, Tansley, Robert wrote:

>> Sample case:
>>
>> Harvester issues a query. DP sends back 100 out of 10,000 results.
>> Harvester then begins to request the consecutive chunks.
>> Given that the
>> total data set is 10,000, this will probably take a while. Before the
>> entire result set is transfered, the DP updates it's repository which
>> shuffle the order in which the results are returned. Objects
>> that were
>> transferred previously are now kicked back to a later
>> position so it is
>> included in a chunk later requested by the harvester.
>>
>> Does the DP now invalidate the resumptionToken or does it assume the
>> Harvester will de-dupe objects on it's side?
>>
>> What about the new objects that have been added and are in chunks of
>> the resultset already transferred? Is it assumed that they will be
>> caught the next time around given that the modifydate SHOULD be later
>> than the last harvest date? Or is it the harvester's
>> responsibility to
>> straighten this all out?
>
> DSpace uses the 'stateless' approach - see 
> http://dspace.org/technology/system-docs/application.html#oai and 
> scroll down a bit.  The sorting is done by (internal database) ID so 
> de-dupping shouldn't be an issue for the harvester.  However your 
> corner case may just cause a problem, or weird side-effect.
>
> Say you're harvesting date range X-Y.  When you first issue the 
> request, a certain set of items have a 'last modified' date within 
> that range, so DSpace returns a load, and a resumption token.  If some 
> items are then modified so that their 'last modified' date is outside 
> the date range X-Y, they'll drop off that list, so suddenly item Z 
> that was result number 101 of those items is now result number 99, and 
> the next harvest request will miss it, since DSpace will think that Z 
> was already served up in the first request.
>
> DSpace would probably work OK in the scenario you've mentioned if the 
> date range specified is X-(present) or no date range; results are 
> sorted by ID so the order wouldn't change, new items would appear at 
> the end of the list and updated items wouldn't have 'moved'.
>
> Deleted items might be a bit yucky though...
>
> Maybe you could to 'freeze' a result set when a harvest comes in, but 
> that may not scale up when your repository is huge and the number of 
> harvests is large (caching dozens of 100,000-long result sets?)
>
> Solutions on a postcard to...
>
>  Robert Tansley / Digital Media Systems Programme / HP Labs
>   http://www.hpl.hp.com/personal/Robert_Tansley/