[OAI-implementers] returning *data* (as opposed to metadata)

Francois Schiettecatte francois@fsconsult.com
Thu, 26 Jul 2001 13:21:11 -0400


My first post to this list :)

On 7/26/01 1:08 PM, "Ben Henley" <ben@biomedcentral.com> wrote:

>> Message: 4
>> Date: Wed, 25 Jul 2001 23:09:23 +0100 (BST)
>> From: Andy Powell <a.powell@ukoln.ac.uk>
>> To: herbert van de sompel <herbertv@cs.cornell.edu>
>> cc: oai-implementers@oaisrv.nsdl.cornell.edu
>> Subject: Re: [OAI-implementers] returning *data* (as opposed to metadata)
>> I don't disagree that the protocol *could* be used to harvest data - I
>> just wonder if it *should* be used in that way.  Particularly at this
>> stage in the life of the protocol?
> This was my concern,  - I can't see why returning full data would be
> bad, in fact it makes a lot of sense from what Michael and Donna have said,
> but it's obviously not quite the intention of the original protocol. Maybe
> all that means is that the OAI should redefine itself as promoting exchange
> of metadata *and* data ... But I wanted to discuss the implications before
> deciding unilaterally that I would start doing weird things with the
> protocol.

I have a full text search and retrieval background, so my comments should be
heard in that context. I think it would be a good idea to include full text
for indexing purposes, thee more text the better the precision and recall
when performing a full text search. Of course this should be balanced
against network bandwidth at the source. I think it should made clear that
this text should be used for indexing only and not for display.

>> Can someone clarify the differences/advantages of harvesting data directly
>> using OAI vs. harvesting metadata using OAI followed by harvesting data
>> using HTTP based on the URL in the metadata?
> One reason might be that the data is available in multiple formats.
> In our case, the URL used as an identifier is a link to an HTML article
> which is rendered from XML. This version looks a lot better to humans and
> the URL is, we think, the appropriate identifier for the article, but
> obviously the HTML wouldn't be so suitable for processing as the XML
> version. We also have PDFs.
> Now we could provide multiple identifier URLs in the oai_dc record
> to allow harvesting that way, I suppose - or is this a valid thing to do? It
> seems to be allowed by the OAI Dublin Core schema:
> <element name="identifier"  minOccurs="0" maxOccurs="unbounded"
> type="string"/>
> but I seem to remember getting the impression from somewhere that you
> should only have one identifier. Could someone clarify this?

I think you should only have one identifier for a document. Document format
and identifier are very separate things. Multiple identifiers for a document
is going to cause confusion and separate identifiers for different formats
for a document are going to cause ever more confusion.


Francois Schiettecatte                               FS Consulting, Inc.
Phone : (410) 625-2080              326 North Charles Street, Suite 300,
Fax   : (410) 625-2081                              Baltimore, MD, 21201
Email : francois@fsconsult.com           URL : http://www.fsconsult.com/