[OAI-implementers] resumptionToken Implementation
tdb01r at ecs.soton.ac.uk
Wed Sep 29 06:49:05 EDT 2004
Matthew Cockerill wrote:
> BioMed Central similarly uses a stateless approach for resumption tokens,
> and I too have been concerned about long term scaleability using
> (a) the stateless approach:
> Retrieving items 999,900 to 1,000,000 of an ordered set from a database
> tends can be a very expensive operation, and using 10,000 such 100 item
> requests in order to retrieve a full listing from an OAI-enabled database
> containing a million records is clearly vastly more expensive (in terms of
> resources) than, say downloading a compressed file containing the data for
> all 1 million records in one go.
Celestial maintains a "cursor" column consisting of the concantenated
datestamp and id (it actually uses only the last three digits of the id,
to keep the cursor shorter - I've made the assumption not more than 1000
records can be stored in under a second).
This allows a daterange/resumption token to be efficiently handled.
Things get slow when that result set has to be filtered for Sets. The
N:M nature of Sets makes it a real pain in the ass.
It is essential that a harvester requests changes since when it
*started* it's harvest, and not when it finished. Strictly, the
harvester needs to use the OAI request timestamp returned by the
repository in the first response.
All the best,
> (b) a stateful approach
> Caching lots of resultsets is the middle tier doesn't really seem easily
> scaleable to very large sets, since cached resultsets tend to be inherently
> memory-resident. A database temporary table for each new request could be
> used, but would create its own resource issues.
> I guess that the best that can be done is to sort items by a unique ordered
> accession number/id/ (which doesn't change if an item is updated) and to use
> this value as the resumption token, rather than using "offset within the
> ordered set" as the resumption token. This should help both reliability and
> performance (since appropriate relational database indexes can allow the
> performance of set=xxxx and accessionnumber>yyyy this to be tuned pretty
> effectively, in a way that
> set=xxxx and offset>yyyyyy
> cannot be
> Matthew Cockerill Ph.D.
> Technical Director
> BioMed Central ( http://www.biomedcentral.com/ )
> 34-42, Cleveland Street
> W1T 4LB
> Tel 020 7631 9127
> Fax: 020 7580 1938
> Email: matt at biomedcentral.com
>>DSpace uses the 'stateless' approach - see
>>and scroll down a bit. The sorting is done by (internal
>>database) ID so de-dupping shouldn't be an issue for the
>>harvester. However your corner case may just cause a
>>problem, or weird side-effect.
>>Say you're harvesting date range X-Y. When you first issue
>>the request, a certain set of items have a 'last modified'
>>date within that range, so DSpace returns a load, and a
>>resumption token. If some items are then modified so that
>>their 'last modified' date is outside the date range X-Y,
>>they'll drop off that list, so suddenly item Z that was
>>result number 101 of those items is now result number 99, and
>>the next harvest request will miss it, since DSpace will
>>think that Z was already served up in the first request.
>>DSpace would probably work OK in the scenario you've
>>mentioned if the date range specified is X-(present) or no
>>date range; results are sorted by ID so the order wouldn't
>>change, new items would appear at the end of the list and
>>updated items wouldn't have 'moved'.
>>Deleted items might be a bit yucky though...
>>Maybe you could to 'freeze' a result set when a harvest comes
>>in, but that may not scale up when your repository is huge
>>and the number of harvests is large (caching dozens of
>>100,000-long result sets?)
>>Solutions on a postcard to...
More information about the OAI-implementers