[OAI-implementers] custom error reporting, records in which sets.

Simeon Warner simeon@cs.cornell.edu
Tue, 23 Jul 2002 11:08:11 -0400 (EDT)


On Tue, 23 Jul 2002, Jozef Kruger wrote:
> Hello Tim and others,
> 
> > > [set membership in headers]
> > > So my question is, what is the precise use of this new "feature" in 
> > > OAI (of reporting all sets a record occurs in)?
> > 
> > This allows the set hierachy to be more easily 
> > mirrored/federated. In OAI 1.x the harvester has to query 
> > each set to recreate the set hierachy.
>
> Yes, which results in a minimum number of records that have to be
> fetched, when you would use the setSpec in the reply to GetRecord, you
> would have to fetch every single record, instead of just all the records
> that appear in a set. I don't see how this extra functionality could
> enhance the mirroring of sets. Besides, I would think that you would
> like to define sets without having to change the records that go into
> those sets. Presuming you don't want to do change the records, you would
> always have the same problem as I do, no matter how you represent your
> sets.

The v2 specification permits complete harvesting of all set information
with just one ListRecords request (with no setSpec). Each record is
harvested precisely once, the header contains all set information. This is
efficient.
 
A change in the set membership of a record is a change in the record. The
datestamp should be changed.
 
> I'd think that using the ListSets followed by a ListRecords on every set
> would be the nicest and most efficient way to (for example) mirror a
> site.

I each record were in an average of (n) sets then this would result in
harvesting each record (n+1) times (one for each set plus once with
no setspec). That does not sound efficient to me.

> Also, if a set represents a meaningful relation between the records in
> it, it would be logical to first query for the sets and then to get the
> records from a set you're interested in. The other way around doesn't
> make any sense, since you can't search for records on anything but their
> Identifier.

Bear in mind that sets are designed to permit selective harvesting and the
semantics of sets is not defined within the protocol (use of sets will
likely require direct communication between DP and SP, or community
agreement). The exception to this is mirroring where a mirror might
aim to preserve all set membership information without understanding it.

Unless a repository has some particular application that requires part of
its contents to be harvested then I think it should not, in general,
create an arbitrary collection of sets representing a subject
classification or such. The example I know best is arXiv.org which is
divided into different areas, historically NCSTRL harvested only the 'cs'
(Computer Science) portion of arXiv -- sets provide a way to support this
functionality.

Remember that sets are optional! If they are difficult to implement and
there is no well defined need then I suggest that repositories do not
implement them. For arXiv I think we have a need and I have gone to the
trouble of building and maintaining an extra index similar to the one Tim
described.

Cheers,
Simeon.


> > Thanks for your reaction, awaiting more :)
> 
> Cheers,
> Jozef Kruger