[OAI-implementers] Re: [Dspace-tech] Google Scholar and OAI (fwd)

Thu Feb 3 04:59:10 EST 2005

I'm taking the liberty of copying a message to the dspace-tech list here,
since I propose the use of the HTML <link> tag and I know that my previous
suggestions about using this tag on this list have only received a
luke-warm response! :-)

These suggestions stem from the need for Google robots to auto-discover
the OAI BASEURL for any given repository.

So I'd be interested in thoughts on my suggestions below - particularly on
whether we need something like an 'application/oai+xml' MIME type
registration (to say that the XML conforms to one of the responses in the
OAI-PMH).

Andy
--
Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
http://www.ukoln.ac.uk/ukoln/staff/a.powell/      +44 1225 383933
Resource Discovery Network http://www.rdn.ac.uk/

---------- Forwarded message ----------
Date: Thu, 3 Feb 2005 09:35:41 +0000 (GMT Standard Time)
From: Andy Powell <a.powell at ukoln.ac.uk>
To: MacKenzie Smith <kenzie at mit.edu>
Cc: dspace-tech at lists.sourceforge.net, dspace-general at mit.edu,
     Peter Brantley <peter.brantley at ucop.edu>
Subject: Re: [Dspace-tech] Google Scholar and OAI

On Wed, 2 Feb 2005, MacKenzie Smith wrote:

> SO, they are interested in evaluating using OAI for this purpose (hooray!)
> but alas, many of you have changed the default OAI baseurl so they can't
> find your OAI server. I know that's true, because for the pilot project we
> did I found 4 different baseurl patterns for 17 DSpace sites... I suggested
> using a registry like the DSpace wiki or OCLC's for this, but they claim
> this will not scale to the level of the gazillions of repositories that
> they hope will exist in the future. They want an approach like robots.txt
> -- predictable place, same for every repository. I think that sounds
> reasonable... don't you?

No, not really!  Clearly the 'well known location' approach (like
robots.txt) works well in many cases, but it is not without problems -
particularly for those people who want to run repositories on servers
where they don't have access to create a file (or whatever) at the chosen
location.

More importantly, I think that any agreements that you (the DSpace
community) reach with Google need to scale not just to all DSpace users,
but to those who choose to offer their repositories using other software
(like eprints.org) and even to those who choose to base their repositories
on more mainstream technologies like content management systems.

So, I don't think that a 'DSpace well known location' is the right
approach.  I offer two alternatives, neither of which is particularly well
thought thru - but these ideas might be a useful basis for further
refinement?

1) Embed a <link rel="oaipmh" ...> tag into the <head> section of the
repository home page, linking to the BASEURL of the OAI server for that
repository (note, there is no requirement that the BASEURL is on the same
Web server as the repository itself - this is a completely open linking
mechanism).  E.g.

<link rel="oaipmh" type="application/xml"  title="OAI BASEURL"
href="http://etheses.nottingham.ac.uk/perl/oai2" />

2) Embed a <link rel="meta" ...> tag into the <head> section of each
eprint's 'jump-off' page (sorry, my terminology is probably wrong in the
context of DSpace here), linking to an OAI GetRecord request for the
metadata about the current eprint.  E.g.

<link rel="meta" type="application/xml" title="OAI/DC Metadata"
href="http://etheses.nottingham.ac.uk/perl/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai%3Aetheses.nottingham.ac.uk.OAI2%3A1"
/>

Option 1 has the advantage of being OAI-specific (but not
DSpace-specific), i.e. the Google robot (or any other robot for that
matter) would know that it would have to deal with the OAI protocol at the
end of a rel="oaipmh" link.

Option 2 has the advantage of being in line with existing conventions
(e.g. DCMI recommendations) for linking from an HTML page to some metadata
about that page.  However, it would probably benefit from a more specific
MIME type, e.g. application/oai+xml, which doesn't exist yet AFAIK.

Given that the 'jump-off' page is generated dynamically by the system (I
assume) there is no significant overhead with option 2 (in terms of having
to manually embed lots of link tags).

Clearly, both approaches could be used in tandem.  However, given minimal
knowledge about the workings of the OAI protocol, it would be possible for
the Google robots to work out the BASEURL of a repository having seen the
first of the GetRecord URLs embedded in option 2.  Therefore, option 1
might be unnecessary if option 2 is being used.

Thoughts?

Andy
--
Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
http://www.ukoln.ac.uk/ukoln/staff/a.powell/      +44 1225 383933
Resource Discovery Network http://www.rdn.ac.uk/