[OAI-implementers] returning *data* (as opposed to metadata)

Simeon Warner simeon@lanl.gov
Thu, 26 Jul 2001 11:28:45 -0600 (MDT)

On Thu, 26 Jul 2001, Ben Henley wrote:
> >I don't disagree that the protocol *could* be used to harvest data - I
> >just wonder if it *should* be used in that way.  Particularly at this
> >stage in the life of the protocol?
> >
> 	This was my concern,  - I can't see why returning full data would be
> bad, in fact it makes a lot of sense from what Michael and Donna have said,
> but it's obviously not quite the intention of the original protocol. Maybe
> all that means is that the OAI should redefine itself as promoting exchange
> of metadata *and* data ... But I wanted to discuss the implications before
> deciding unilaterally that I would start doing weird things with the
> protocol.

I think you should not hesitate to experiment with full data export in
the way you suggest -- so long as you understand that it is an experiment.
It seems from reactions on this list that people will experiment with
harvesting your XML full-content.

The metadata focus of OAI must remain for the moment, not for any pressing
technical reasons but simply because it is politically much easier to get a
large group of archives to export of metadata than full-content.  Without
widespread support, OAI will go nowhere. I'm sure that there will, at some
stage, be an OAI-approved way to allow harvesting of full content but I
doubt that this will be compulsory (as oai_dc metadata is).

>   One reason might be that the data is available in multiple formats.

Yes, the whole issue of content-models (multiple formats, multiple parts,
etc) is avoided when we consider only metadata.

> 	Now we could provide multiple identifier URLs in the oai_dc record
> to allow harvesting that way, I suppose - or is this a valid thing to do? It
> seems to be allowed by the OAI Dublin Core schema:
> 	<element name="identifier"  minOccurs="0" maxOccurs="unbounded"
> type="string"/>
>  but I seem to remember getting the impression from somewhere that you
> should only have one identifier. Could someone clarify this?

It is my understanding that having multiple identifier elements is
fine. The trouble is that there is no well defined way to say what
they mean. Note that by being an `OAI compliant archive', a data-provider
says that automated harvesting of metadata is permitted. It is not
implied that automated harvesting of full-content from URLs
specified in the identifier elements is permitted. It helps
if this is spelled out in the Identify response, e.g. in the e-prints
description section for arXiv:

     <text>Metadata harvesting permitted through OAI interface</text>
     <text>Full-content harvesting not permitted (except by special