[OAI-implementers] OAI-PMH baseURL discovery

Michael Nelson mln at cs.odu.edu
Sun Feb 13 15:07:22 EST 2005


(this is in response to Andy's mesg:
http://www.openarchives.org/pipermail/oai-implementers/2005-February/001407.html)

Drawing from our experience with mod_oai, we see at least 4 possible
ways for robots to "automatically" discover OAI-PMH baseURLs:

1.  develop a separate file, oaimph.txt, similar in spirit to robots.txt

2.  add to the existing robots.txt file

3.  use HTML link or META tags for robots

4.  use the <friends> component in the Identify response.

We do not prefer #1 - a separate file for robots to check seems unlikely
to encourage widespread adoption.

We prefer #2 because it injects OAI-PMH into the regular web
mechanics where it belongs.  Robots already look for this file -
why not put OAI-PMH statements where they expect to find guidance?
Similarly, a robots.txt file is easy to install and edit (certainly
easier than installing most repository software packages), so there
will be no additional burden on a repository administrator.

#3 can be used in some cases, but it makes an assumption that every
repository we would like a robot to find has an HTML presence.  #2 and #3
can be used separately since they address separate use cases.

#4 is important and needs to be reinforced as a way of repositories
"pointing" to each other.  You can't bootstrap baseURL discovery via
<friends>, but once a robot knows about a single baseURL, it should be
able to assemble a list of cooperating repositories.  No new functionality
is needed for <friends>, but the robot scenario increases the importance
of its use.

robots.txt
----------

The "problem" with robots.txt is that the syntax is very simple and is
focused on telling robots what they can't do and not on what they should
do.  So in addition to having a line such as:

OAIPMHbaseURL: http://cs1.ist.psu.edu/cgi-bin/oai.cgi

We would like to expand the syntax of the "Disalllow:" tag to include
alternatives:

Disallow: /citations/ OAIPMHbaseURL:
http://cs1.ist.psu.edu/cgi-bin/oai.cgi

Where the 2nd line is the alternate access for how to get at the
information prohibited in the Disallow.  Depending on how robust
robots are with respect to extended syntax, we could repeat the line
in case the extended line is not understood:

Disallow: /citations/
Disallow: /citations/  OAIPMHbaseURL:
http://cs1.ist.psu.edu/cgi-bin/oai.cgi

HTML Tags for Robots
--------------------

It would be useful to tie an existing HTML page back to the original
OAI-PMH repository from which it came, such as:

http://uk.arxiv.org/abs/astro-ph/0502028

having something like:

<META NAME="ROBOTS" OAIPMHbaseURL="http://www.arxiv.org/oai2">

It would also be useful to tie the HTML representation back to
the structured metadata from which it came:

<META NAME="ROBOTS"
OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
ataPrefix=oai_dc&identifier=oai:arXiv.org:astro-ph/0502028">

<META NAME="ROBOTS"
OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
ataPrefix=oai_marc&identifier=oai:arXiv.org:astro-ph/0502028">

This is similar to inverse of a DC.Identifier field -- instead of mapping
from structured to un/semi-strucutred, it maps from un/semi-strucutred
to structured.

comments welcome,

Michael Nelson & Herbert Van de Sompel

----
Michael L. Nelson mln at cs.odu.edu http://www.cs.odu.edu/~mln/
Dept of Computer Science, Old Dominion University, Norfolk VA 23529
+1 757 683 6393 +1 757 683 4900 (f)



More information about the OAI-implementers mailing list