[OAI-implementers] Sets in OAI-PMH and DSpace

Tansley, Robert robert.tansley@hp.com
Tue, 21 Oct 2003 13:11:51 -0700


> I think sets are going to be absolutely vital -- it'll be 
> interesting, 
> however, to see how they develop.  I think that data 
> providers have tended 
> to create sets based on their internal structures and needs;  
> it might be 
> more productive to understand what service providers need/use 
> sets for.

This is an excellent point.  It seems to me that to make reasonable any use of sets, there must be some understanding between the harvester and data provider between the meaning of the sets.  This is very much akin to the metadata interoperability problem.  Perhaps we could find some way of allowing data providers to expose their metadata in terms of standard vocabularies; e.g. a set hierarchy based on media type, or subject categorisation.  A quick and practical way to do this would be to check for the existence of certain top-level sets with specific IDs.  For example, if one supported LCSH sets, one could have a top-level set with the setSpec 'LCSH' and harvesters could look for this and react accordingly if it does or does not exist.  Or perhaps a future version of OAI-PMH could support 'set formats' in a similar manner to metadata formats.

This does all seem to be pushing work to the data provider; I could see that this may lead to sets being used effectively as dynamic queries to repositories.  For example, a data provider might respond to a harvest of a LCSH set using some index search.  It does seem that the smarts for what get harvested (i.e. what is interesting for the users of the OAI service doing the harvesting) is a decision that should be made by the OAI service rather than the data provider.

Whether sets are perceived as vital, useful or useless, however, it does seem to me there seems to be some way for a service provider to be able to programmatically work out whether it understands (at least some of) the set structure of a data provider.  Otherwise it'll only every be useful for the 'point-to-point' case.  It is a very similar problem to the 'metadata format' problem, except that at the moment we don't have a simple base 'set format' akin to Dublin Core.

Which leads me back to the original point of the thread, what should we do about sets in DSpace?  The above thinking leads me to believe that exposing sets in DSpace is of very limited use, since the structure is probably going to be unique between instances of DSpace, and harvesters would require very specific knowledge of the individual DSpace instance to be able to make use of those sets.

Discuss ;-)

 Robert Tansley / Hewlett-Packard Laboratories / (+1) 617 551 7624