[OAI-implementers] returning *data* (as opposed to metadata)
Thu, 26 Jul 2001 13:21:11 -0400
My first post to this list :)
On 7/26/01 1:08 PM, "Ben Henley" <firstname.lastname@example.org> wrote:
>> Message: 4
>> Date: Wed, 25 Jul 2001 23:09:23 +0100 (BST)
>> From: Andy Powell <email@example.com>
>> To: herbert van de sompel <firstname.lastname@example.org>
>> cc: email@example.com
>> Subject: Re: [OAI-implementers] returning *data* (as opposed to metadata)
>> I don't disagree that the protocol *could* be used to harvest data - I
>> just wonder if it *should* be used in that way. Particularly at this
>> stage in the life of the protocol?
> This was my concern, - I can't see why returning full data would be
> bad, in fact it makes a lot of sense from what Michael and Donna have said,
> but it's obviously not quite the intention of the original protocol. Maybe
> all that means is that the OAI should redefine itself as promoting exchange
> of metadata *and* data ... But I wanted to discuss the implications before
> deciding unilaterally that I would start doing weird things with the
I have a full text search and retrieval background, so my comments should be
heard in that context. I think it would be a good idea to include full text
for indexing purposes, thee more text the better the precision and recall
when performing a full text search. Of course this should be balanced
against network bandwidth at the source. I think it should made clear that
this text should be used for indexing only and not for display.
>> Can someone clarify the differences/advantages of harvesting data directly
>> using OAI vs. harvesting metadata using OAI followed by harvesting data
>> using HTTP based on the URL in the metadata?
> One reason might be that the data is available in multiple formats.
> In our case, the URL used as an identifier is a link to an HTML article
> which is rendered from XML. This version looks a lot better to humans and
> the URL is, we think, the appropriate identifier for the article, but
> obviously the HTML wouldn't be so suitable for processing as the XML
> version. We also have PDFs.
> Now we could provide multiple identifier URLs in the oai_dc record
> to allow harvesting that way, I suppose - or is this a valid thing to do? It
> seems to be allowed by the OAI Dublin Core schema:
> <element name="identifier" minOccurs="0" maxOccurs="unbounded"
> but I seem to remember getting the impression from somewhere that you
> should only have one identifier. Could someone clarify this?
I think you should only have one identifier for a document. Document format
and identifier are very separate things. Multiple identifiers for a document
is going to cause confusion and separate identifiers for different formats
for a document are going to cause ever more confusion.
Francois Schiettecatte FS Consulting, Inc.
Phone : (410) 625-2080 326 North Charles Street, Suite 300,
Fax : (410) 625-2081 Baltimore, MD, 21201
Email : firstname.lastname@example.org URL : http://www.fsconsult.com/