[OAI-implementers] protocol comments, OAI 2.0

Simeon Warner simeon@cs.cornell.edu
Wed, 30 Jan 2002 14:08:26 -0500 (EST)


Walter,

Thanks for your comments. 

On Fri, 25 Jan 2002, Walter Underwood wrote:
> Themes: use SOAP, make it stateless, make it simpler.
> 
> SOAP is being implemented everywhere. Even AppleScript can make SOAP
> calls. The last time I saw this many vendors use one standard was
> Ethernet. So replace the custom XML protocol with SOAP.

The technical committee agreed that the protocol should be more decoupled
from HTTP but we didn't feel that SOAP is the correct option at the
moment. It is likely that some people will experiment with OAI v2 over
SOAP.
 
> A stateless protocol can be cached outside of the server,
> so that is a very valuable thing to do. The current spec
> has a defensive tone about load. That should not be necessary.
> A few modifications will make the protocols stateless and
> cache-friendly.
> 
> ListIdentifiers and ListRecords are stateful now. They should be
> replaced with a "paged list" model, where the client requests a
> starting element number and a number of results. This is an exact
> match for the normal web results interface. It is also the
> approach used in the LDAP virtual list control.

Load is a concern for some implementers. For example, arXiv (the
repository I work with) would not want to give clients the opportunity to
ask for all 185,000 metadata records in one response.
 
> ListIdentifiers/Records also needs to be very clear about the
> contents of successive calls for different parts of the
> list. Consider a rapidly changing repository, like a newswire.
> The results for the list may change between calls. The list
> is not part of some transaction, where the contents don't change
> for the duration of the session. If a client really needs a
> consistant list, it can ask for the whole thing. Each request
> for a portion of the list is independent and can be cached.

OAI is implicitly not focused on rapidly changing repositories and this
has influenced the design of v1.1 and continues to influence the design of
v2. Creating a low barrier to adoption is considered very important and
the use of an opaque resumptionToken gives implementers great flexibility.
For example, for arXiv the only state involved in a set of list requests
is in the resumptionToken (nothing is stored on the repository). In other
implementations the result set is cached. Allowing implementers this
flexibility reduces the barrier to adoption by repositories.
 
> Datestamps are problematic in protocols, and should not be
> used. Computers insist that you choose some time for that
> day. Is it noon? One minute after midnight? How do you compare
> that to a time on the same day? So don't allow datestamps in
> protocols. Always use timestamps to the second, in UTC. Don't
> allow time zones or less precision.

There has been broad agreement to move to UTC for all date/timestamps.
However, there is support for allowing different granularities (with well
defined precision extension semantics) to reflect the underlying
granularity of some repositories.

> Of course, the metadata may contain date-only times.
> 
> Since deleted record items are not reliable, they are not all
> that useful. After the robot is burned the first time, the
> implementors will switch to polling the entire repository.
> They are sort of useful as hints, but I can see serious
> problems in some uses. In the newswire example, it is common
> to have a limited time right to the news articles, perhaps
> two weeks. So a reliable list of deleted items would grow
> without bound. Not good.

Clearly such a repository would maintain records of deleted items only for
a fixed time. That would still be useful provided the harvesting interval
is much smaller than the expiry time. Without deleted records, one is
forced to poll in all cases.

> Robots can use something like the HTTP if-modified-since request.
> This would be a parameter for GetRecord. If the record has
> been modified since the timestamp, return it. Otherwise, return
> the info that it has not been modified. Implementors should try
> and make this request fast.

ListRecords provides a better way to say if-modified-since for any set or
the whole repository. I think that in many implementations the cost of
implementing a not-modified response to if-modified-since would be 
only marginally less than returning the metadata.
 
> I have not looked at the interaction between SOAP and HTTP caches.
> It is possibled that they can be cached. If so, take some extra care.
> A GET can be satisfied by a cache, but a POST cannot. Properly
> setting the HTTP headers on responses means that a server can rely
> on an external HTTP cache rather than managing an internal cache.
> This is a big win.
> 
> The "Set" concept is optional and I don't see much motivation for it.
> I don't see a request which allows multiple sets, so there is no
> reason to have both sets and repositories. Except for listing them
> (ListSets), which is a limited sort of directory service.

It is not possible to get away from the concept of repositories (usually
different servers run by different people). Some people want to use sets
and they can; others can ignore them and there is no overhead in that
case. One historical example is that arXiv is partitioned into four sets
(cs, math, physics, nlin) and the NCSTRL digital library wanted to harvest
only the cs set. The Dienst equivalent of OAI-PMH's sets facilitated this.  
Note that the use of sets is not limited to subject partitioning.

> In SOAP-land, directory service is the job of UDDI, and it should
> not be re-implemented differently in a protocol. So remove Sets.
> A single SOAP server can still be registered for multiple repositories,
> or corpora, or collections, or whatever they are called.

Sets provide for a simple, arbitrarily defined partitioning of a
repository. There is interest in defining community specific set
organizations to aid in aggregation within certain communities. This
doesn't seem to be the same as the purpose of UDDI (which could still
prove a sensible model for global registration of OAI services).

> I have a few minor nits, like there is no reason to outlaw XML
> features which are required in all parsers, like UTF-16 or decimal
> entities. But switching to SOAP makes lexical issues moot.
> 
> Overall, the protocol is good. Using URIs, Dublin Core, HTTP/XML,
> all those things are exactly right. The spec is clear and readable.
> And I'm delighted that it doesn't use RDF.

Feelings about RDF seem so frequently to be almost religious! 

Cheers,
Simeon.

[My comments are influenced by the discussions of the OAI technical
committee but I am speaking only for myself]

 
> wunder
> --
> Walter R. Underwood
> Senior Staff Engineer
> Inktomi Enterprise Search
> http://search.inktomi.com/
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers