[OAI-implementers] resumptionToken Implementation
robert.tansley at hp.com
Tue Sep 28 18:03:23 EDT 2004
> Sample case:
> Harvester issues a query. DP sends back 100 out of 10,000 results.
> Harvester then begins to request the consecutive chunks.
> Given that the
> total data set is 10,000, this will probably take a while. Before the
> entire result set is transfered, the DP updates it's repository which
> shuffle the order in which the results are returned. Objects
> that were
> transferred previously are now kicked back to a later
> position so it is
> included in a chunk later requested by the harvester.
> Does the DP now invalidate the resumptionToken or does it assume the
> Harvester will de-dupe objects on it's side?
> What about the new objects that have been added and are in chunks of
> the resultset already transferred? Is it assumed that they will be
> caught the next time around given that the modifydate SHOULD be later
> than the last harvest date? Or is it the harvester's
> responsibility to
> straighten this all out?
DSpace uses the 'stateless' approach - see http://dspace.org/technology/system-docs/application.html#oai and scroll down a bit. The sorting is done by (internal database) ID so de-dupping shouldn't be an issue for the harvester. However your corner case may just cause a problem, or weird side-effect.
Say you're harvesting date range X-Y. When you first issue the request, a certain set of items have a 'last modified' date within that range, so DSpace returns a load, and a resumption token. If some items are then modified so that their 'last modified' date is outside the date range X-Y, they'll drop off that list, so suddenly item Z that was result number 101 of those items is now result number 99, and the next harvest request will miss it, since DSpace will think that Z was already served up in the first request.
DSpace would probably work OK in the scenario you've mentioned if the date range specified is X-(present) or no date range; results are sorted by ID so the order wouldn't change, new items would appear at the end of the list and updated items wouldn't have 'moved'.
Deleted items might be a bit yucky though...
Maybe you could to 'freeze' a result set when a harvest comes in, but that may not scale up when your repository is huge and the number of harvests is large (caching dozens of 100,000-long result sets?)
Solutions on a postcard to...
Robert Tansley / Digital Media Systems Programme / HP Labs
More information about the OAI-implementers