[OAI-implementers] Re: Reconsidering mandatory DC in OAI-PMH

Andy Powell a.powell@ukoln.ac.uk
Mon, 11 Aug 2003 11:23:56 +0100 (GMT Daylight Time)


On Mon, 11 Aug 2003, Thomas Krichel wrote:

>   Andy Powell writes
>
> > Err... services based on minimal metadata
>
>   These are not likely to be used or useful. I rather have
>   Google index my web pages that render the full metadata
>   for humans. BTW, Google have taken the business
>   of resource discovery for them. We better get used to it.

Sure... I don't disagree.  But there's some functionality that Google
doesn't currently offer, e.g. an author search.  I cannot do Google search
for everything authored by you for example.  I also can't do a Google
search for resources of a particular type - e.g. 'research papers'.
Similarly, it is not easy to search Google for resources about a
particular geographic area - try searching Google for resources about
'Bath' for example.

An OAI-based service provider could offer such a service based on minimal
metadata (dc:creator, dc:title, dc:identifier, dc:description,
dc:coverage, dc:type) plus an index of the full-text (where it is
available). (It'd be even better if Google did this themselves!).

Now, to come back to the original point... there will be some classes of
resource for which oai_dc is not appropriate (e.g. people and
organisations) in which case one can only return a very minimal oai_dc
record (e.g. only dc:identifier).  In these cases, the service provider
will have to 'know' to ask for an alternative format if it wants to offer
a 'rich' service.  But this is also the case if oai_dc is made optional.
So I don't understand why making oai_dc optional buys us anything.

In short, making oai_dc optional loses us a lot of low-barrier
interoperability in the general case (where DC is appropriate) and doesn't
gain us anything in those more limited cases where DC is inappropriate.
So why do it?

> > plus an index of full-text (where it is available for harvesting at
> > the URI provided in dc:identifier).
>
>   dc:identifier is the identitfier of the item, not of the
>   full text, is in not? Say I have an academic paper,
>   dc:identifier should be the id of the paper, not the
>   url of the full-text?

dc:identifier is the identifier of the 'resource'.  If by 'item' you mean
OAI item then the oai_identifier is the identifier of the item.

The identifier of the paper may or may not be resolvable to the full-text.
I would hope that in general it is.  I think it would be very unhelpful if
it became the norm for providers to put an identifier into dc:identifier
that didn't resolve to the full-text.

There is a slight gatcha here however.  In

Using simple Dublin Core to describe eprints
http://www.rdn.ac.uk/projects/eprints-uk/docs/simpledc-guidelines/

we recommend that dc:identifier be

 A URI or bibiographic citation for the eprint, typically the URI of the
 'jump-off page' for the eprint, as served by the archive.

There is therefore a slight hiccup for any service provider that wishes to
harvest the full-text as well as the metadata - because their robot may
have to negotiate a jump-off page.  This is slightly unfortunate, but it
represents current practice by both eprints.org and DSpace (as far as I
know).

Andy
--
Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
http://www.ukoln.ac.uk/ukoln/staff/a.powell       +44 1225 383933
Resource Discovery Network http://www.rdn.ac.uk/