[OAI-implementers] Support for Tim Cole's comments

Caroline Arms caar@loc.gov
Tue, 12 Feb 2002 13:13:01 -0500 (EST)

I would just like to endorse Tim Cole's comments about appreciating the
flexibility of the resumptionToken.  Our implementation is somewhat
similar to his in that there is no database management system, but static
files of records, which are updated infrequently.  In our case, a whole
"set" is likely to be updated at once, which means that fine granularity
of timestamps will not help with the issue of dividing the response into
chunks.  We have been using fairly small response sets for records and
expiring the resumptionTokens fairly quickly, in order to avoid problems
from major updates between issue of the token and its use.  We have no way
to ensure that the order would be the same after the update.

We would certainly be interested in hearing from harvesters if our chunks
are annoyingly small and if our short expiry times are causing problems.  
We made implementation decisions for these based on no information and
would be happy to reconsider based on real experience.

    Caroline Arms                                    caar@loc.gov
    National Digital Library Program
    Information Technology Services
PS I will be away for a week.  I'll deal with any responses to this
message on my return.

On Fri, 8 Feb 2002, Tim Cole wrote:

> Not to curtail the very interesting technical back and forth, but...
> The flexible and naive nature of the resumptionToken parameter and the fact
> that the OAI-PMH doesn't allow Service Providers to request a fixed number
> of records is very much by design.  The minimum granularity and inherent
> limitations of the datestamp argument was also a decision made after some
> thought.  Given the intended mission of the OAI PMH, I believe the decisions
> were correct.  (Whether there's really a niche for what OAI-PMH is intended
> to be is of course open to debate.)
> OAI PMH was created initially to facilitate interchange of metadata between
> E-Print archives.  These archives could be characterized by several
> characteristics -- among them that data contained in the archive changed
> relatively slowly (i.e., on average relatively few new records added,
> changed or deleted day to day) and that the repositories were built on
> limited resources and with limited capabilities (some didn't even support
> keyword search of full-text of documents held in the repository).
> Accordingly OAI PMH built in a lot of flexibility (and a certain amount of
> wiggle room) for implementers, particularly metadata providers.  Timestamps
> with granularity of only 1 day were allowed.  Flow control was implemented
> in the least prescriptive, most stateless way possible.
> Some metadata provider services have been built to take advantage of this
> flexibility.  For instance I have an experimental OAI provider service that
> has no database management software behind it at all.  Instead it relies on
> the implementation platform's file system.  Metadata is stored in XML files
> and dynamically transformed when requested to the requested metadata schema
> using XSLT.  The number of record chunk size returned for a ListRecord
> request varies according to the number of records in each file system
> directory at the time the request is received.  The order in which records
> are returned is determined by the implementation platform's file system and
> typically is not chronological, meaning it will change between requests as
> records are added, deleted, and updated.  This implementation would not be
> able to return a fixed number of records specified by the Service Provider
> without substansial changes to its basic design.
> The resumption token as used in this implementation includes the requested
> metadata prefix, the date range values of the original request, and a list
> of remaining directories to be exported.  No state information is ever
> maintained on the server side, and the number of records returned in
> response to a request with a resumption token isn't determined until the
> request is received and processed.  (Thus a later request with same
> resumption token may get more or less records.) Datestamps are maintained to
> the day only (no hours, minutes, or seconds).  Implementing locking or
> creating some sort of state maintainence mechanism would require substansial
> and fundamental changes to the design of this implemetation.
> I believe the implementation conforms to the current protocol document, and
> I'm reasonably sure that with only minor changes it will conform to the 2.0
> spec.  I've been surprised at how hard it is to break, though I certainly
> don't expect it to be as reliable and robust as some other implementations
> I've seen..  It does what it was designed to do.
> However this implementation clearly does not support precise harvesting
> along the lines that have been discussed on this list over the course of the
> last week or two.  The resumptionToken is not deterministic, but only a
> somewhat imprecise method used to chunk a long response.  I would contend
> that given that the provider implementation descirbed is intended only to
> handle a respository of at most a few 10s of thousands of metadata records
> and in which additions, updates, and deletions occur at most weekly, and
> more often monthly, the imprecise harvesting does not lead to poor
> representation of the metadata stored in my repository, and therefore should
> not be of concern to Service Providers.  Of course that's debatable.
> Which is the question before the OAI Community at this point in time.  Is
> there really a niche for a relatively simple protocol that allows in certain
> instances for less precise harvesting?  (For instance we've known from the
> start that some re-harvesting occurs because datestamps only have
> granularity of one day.)  Can services built on such a protocol be useful --
> at least for certain purposes?  Obviously not for a bank trying to do
> financial transactions, but perhaps in the DL world.  A number of us are
> trying to answer these kinds of questions by empirical means rather than
> speculation.
> Given that there can be circumstances when a metadata provider might want to
> avoid overhead of a transactional database system, I would very much oppose
> moving OAI-PMH in the direction of SQL style transactions and cursors.  I
> would also oppose, especially as a required functionality, upgrading flow
> control to allow SPs to specify numbers of records wanted, or to specify
> resuming from a particularly record (which implictly assumes an ordered,
> persistent response object).  These changes would require providers to
> maintain state and would effectively require them to provide transactional
> functionalities -- things many of the current providers aren't in a good
> position to do.  The benefits of such changes for the target audience don't
> seem worth it.  (Which comes back to question raised earlier about whether a
> niche protocol aimed at a particular target audience can survive.  I think
> it can, but we'll have to see.)
> Tim Cole
> University of Illinois at Urbana-Champaign