FW: [OAI-implementers] Open Archives Initiative Protocol for Meta data Harvesting Version 2 news

Martin Vesely Martin Vesely <Martin.Vesely@cern.ch>
Thu, 7 Feb 2002 23:14:30 +0100 (CET)


The described way of caching data is very similar to how the OAI flow
control is done in our repository. But still, I do not see how we can
get rid of resumption tokens.

Since HTTP is request-response, the cached data cannot be sent by the
server with no additional request. The subsequent request has to identify
itself, so the server knows what data are requested. Resumption token is
nothing but an identifier for server and for the reason of identification
is necessary.

I think that the harvesting transaction inevitably deals with such amount
of data that it hardly can be delivered within one HTTP request-response
session. Therefore there has to be some mechanism to enable the transfer
in smaller parts in a sequence of sessions.

There is always a way to issue ListIdentifiers and a sequence of
GetRecord(s), which actually is the way to avoid resumption tokens for
those who have this preference.

A connected issue: Could the protocol be made stateless by a statement
about the OAI transaction? Then this statement should be added to the
protocol. For example: The OAI-transaction is composed of an OAI-request
and a full OAI-response.  Before last package of OAI-response is
delivered, the OAI-transaction is considered not to be finished.

The same way how stateless HTTP deals with underlying packets. This way
the harvester cannot supply any part of the response before it has the
complete harvest available.


CERN Document Server ** <http://cds.cern.ch/> ** <cds.support@cern.ch>
Room: Bldg 510-1-015 ** Voice: +41-22-7673527 ** Fax: +41-22-7678142

On Wed, 6 Feb 2002, Walter Underwood wrote:

>--On Wednesday, February 6, 2002 2:18 PM -0500 Simeon Warner <simeon@cs.cornell.edu> wrote:
>> My main objection to including an option for harvesters to specify the
>> maximum number of records they wish to get in a reply is that this will
>> force ALL repositories to implement resumptionTokens.  Currently, small
>> repsotiories (say a few thousand records) can happily ignore that part of
>> the spec.
>I suggest getting rid of resumption tokens to make it
>simpler for all sizes of repositories.
>A very simple server can always calculate the entire result list,
>then send the portion requested, for example, records 21-30.
>Cache the result list to speed things up.
>Internally, this is much easier to implement than resumption
>tokens. Caching is independent of the correctness of the list,
>so the two are loosely coupled. For simple databases, the slow
>part is getting data from disk, and the existing OS file cache
>will already provided the most important level of caching.
>Large systems will probably use commercial databases, which
>provide additional levels of caching.
>A repository with only a few thousand records could load them
>into memory at startup and reboot when there is a change.
>1K per record, 10K records is only 10Meg. No caching needed.
>Walter R. Underwood
>Senior Staff Engineer
>Inktomi Enterprise Search
>OAI-implementers mailing list