[OAI-implementers] resumptionToken Implementation

Wed Sep 29 06:49:05 EDT 2004

Matthew Cockerill wrote:

> BioMed Central similarly uses a stateless approach for resumption tokens,
> and I too have been concerned about long term scaleability using 
> 
> (a)  the stateless approach:
> Retrieving items 999,900 to 1,000,000 of an ordered set from a database
> tends can be a very expensive operation, and using 10,000 such 100 item
> requests in order to retrieve a full listing from an OAI-enabled database
> containing  a million records is clearly vastly more expensive (in terms of
> resources) than, say downloading a compressed file containing the data for
> all 1 million records in one go.

Celestial maintains a "cursor" column consisting of the concantenated 
datestamp and id (it actually uses only the last three digits of the id, 
to keep the cursor shorter - I've made the assumption not more than 1000 
records can be stored in under a second).

This allows a daterange/resumption token to be efficiently handled. 
Things get slow when that result set has to be filtered for Sets. The 
N:M nature of Sets makes it a real pain in the ass.

It is essential that a harvester requests changes since when it 
*started* it's harvest, and not when it finished. Strictly, the 
harvester needs to use the OAI request timestamp returned by the 
repository in the first response.

All the best,
Tim.

> (b) a stateful approach 
> Caching lots of resultsets is the middle tier doesn't really seem easily
> scaleable to very large sets, since cached resultsets tend to be inherently
> memory-resident. A database temporary table for each new request could be
> used, but would create its own resource issues.
> 
> 
> 
> I guess that the best that can be done is to sort items by a unique ordered
> accession number/id/ (which doesn't change if an item is updated) and to use
> this value as the resumption token, rather than using "offset within the
> ordered set" as the resumption token. This should help both reliability and
> performance (since appropriate relational database indexes can allow the
> performance of  set=xxxx and accessionnumber>yyyy this to be tuned pretty
> effectively, in a way that 
> 
> set=xxxx and offset>yyyyyy 
> 
> cannot be
> 
> 
> Matt 
>  == 
> Matthew Cockerill Ph.D. 
> Technical Director
> BioMed Central ( http://www.biomedcentral.com/ ) 
> 34-42, Cleveland Street 
> London 
> W1T 4LB 
> UK 
> 
> Tel 020 7631 9127 
> Fax: 020 7580 1938 
> Email: matt at biomedcentral.com 
> 
> 
> 
> 
>>DSpace uses the 'stateless' approach - see 
>>http://dspace.org/technology/system-docs/application.html#oai 
>>and scroll down a bit.  The sorting is done by (internal 
>>database) ID so de-dupping shouldn't be an issue for the 
>>harvester.  However your corner case may just cause a 
>>problem, or weird side-effect.
>>
>>Say you're harvesting date range X-Y.  When you first issue 
>>the request, a certain set of items have a 'last modified' 
>>date within that range, so DSpace returns a load, and a 
>>resumption token.  If some items are then modified so that 
>>their 'last modified' date is outside the date range X-Y, 
>>they'll drop off that list, so suddenly item Z that was 
>>result number 101 of those items is now result number 99, and 
>>the next harvest request will miss it, since DSpace will 
>>think that Z was already served up in the first request.
>>
>>DSpace would probably work OK in the scenario you've 
>>mentioned if the date range specified is X-(present) or no 
>>date range; results are sorted by ID so the order wouldn't 
>>change, new items would appear at the end of the list and 
>>updated items wouldn't have 'moved'.
>>
>>Deleted items might be a bit yucky though...
>>
>>Maybe you could to 'freeze' a result set when a harvest comes 
>>in, but that may not scale up when your repository is huge 
>>and the number of harvests is large (caching dozens of 
>>100,000-long result sets?)
>>
>>Solutions on a postcard to...