[OAI-implementers] List Id's for multiple sets

Tim Brody tdb198@ecs.soton.ac.uk
Fri, 9 Feb 2001 09:07:31 +0000 (GMT)


On Thu, 8 Feb 2001, deridder wrote:

>   This is looking more complicated than I expected.  With no dates
> specified, and no sets specified, the list could be enormous;  and as more
> and more sets are added, the resumption tokens could get pretty hairy too.

Excuse my ignorance if this is already obvious to you:

(as suggested by Chris Gutteridge, this is how I have implemented
resumptionTokens)

Initial request:
Build a temporary table of all the identifiers that match the request,
this CAN get huge but if you want harvesters to get all of your repository
there isn't much choice...(indeed I would argue this is more efficient
than enumerating over sets)
Output the first 400 records (or whatever) from the temporary table, using
the identifiers as an index into your database/file system. The
resumptionToken will be the name of your temporary table and an encoded
string to tell you what the metadataFormat is (required for ListRecords).

Temporary table is:
pos	int,auto_increment
id	char(64) ... this is OAI Identifier/your archive identifier, but
if you use OAI to index means ListIdentifiers only needs temporary table

Latter requests:
Get the appropriate list of identifiers by saying get "pos > start".

To manage the temporary tables I have another table, the temp index, which
stores the table names and the last time they were accessed. Whenever a
query is started I remove old temporary tables and their associated
entries in the temp index. To make the resumptionToken even simpler you
could store the metadataPrefix in the index ...

The initial request can be very slow, as it has to enumerate over your
entire archive, but subsequent requests are very quick. Each harvester (if
it is well behaved) will only need to do this once, subsequent queries
should use "from" to only grab the latest data.

e.g. (liable to be broken and knackered as is my wont)
http://cite-base.ecs.soton.ac.uk/cgi-bin/oai/OAI-script?debug=1&verb=ListRecords&metadataPrefix=oai_dc

As an aside, I have tried to write my OAI "bits" to be in a seperate, non
archive-specific library - would people be interested in access to this (I
can not guarantee its correctness nor robustness, just it supports the
bits of OAI that I've needed)?

All the best,
Tim Brody
Computer Science, University of Southampton
email: tdb198@soton.ac.uk
Web: http://www.ecs.soton.ac.uk/~tdb198/