[OAI-implementers] repository auto-discovery

Michael Nelson mln at cs.odu.edu
Sun Nov 19 11:10:06 EST 2006


>well, i'm aiming at something much lower: namely, how to get the baseUrl
>of an OAI PMH data provider? and it seems particularly embarassing, that
>i have no standard way to advertise my own service to people (including
>robots) surfing my own pages.

Herbert and I talked about this some time ago and had a preference for
adding to robots.txt to inform crawlers about baseURLs.  At the time, few
outside of the DL community were supporting OAI-PMH, but perhaps it is
time to revisit this.  Here is the proposal; the syntax could be tweaked
w/ robots.txt "Allow:", HTML <link> etc., but this should give the idea:

===

OAI-PMH baseURL discovery

Drawing from our experience with mod_oai, we see at least 3 possible
ways for robots to "automatically" discover OAI-PMH baseURLs:

1.  develop a separate file, oaimph.txt, similar in spirit to robots.txt

2.  add to the existing robots.txt file

3.  use HTML link or META tags for robots

We do not prefer #1 - a separate file for robots to check seems unlikely
to encourage widespread adoption.

We prefer #2 because it injects OAI-PMH into the regular web
mechanics where it belongs.  Robots already look for this file -
why not put OAI-PMH statements where they expect to find guidance?

#3 can be used in some cases, but it makes an assumption that every
repository we would like a robot to find has an HTML presence.  #2 and #3
can be used separately since they address separate use cases.

robots.txt
----------

The "problem" with robots.txt is that the syntax is very simple and is
focused on telling robots what they can't do and not on what they should
do.  So in addition to having a line such as:

OAIPMHbaseURL=http://cs1.ist.psu.edu/cgi-bin/oai.cgi

We would like to expand the syntax of the "Disalllow:" tag to include
alternatives:

Disallow: /citations/   http://cs1.ist.psu.edu/cgi-bin/oai.cgi

Where the 2nd line is the alternate access for how to get at the
information prohibited in the Disallow.  Depending on how robust
robots are with respect to extended syntax, we could repeat the line
in case the extended line is not understood:

Disallow: /citations/
Disallow: /citations/   http://cs1.ist.psu.edu/cgi-bin/oai.cgi

HTML Tags for Robots
--------------------

It would be useful to tie an existing HTML page back to the original
OAI-PMH repository from which it came, such as:

http://uk.arxiv.org/abs/astro-ph/0502028


having something like:

<META NAME="ROBOTS" OAIPMHbaseURL="http://www.arxiv.org/oai2">

It would also be useful to tie the HTML representation back to
the structured metadata from which it came:

<META NAME="ROBOTS"
OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
ataPrefix=oai_dc&identifier=oai:arXiv.org:astro-ph/0502028">

<META NAME="ROBOTS"
OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
ataPrefix=oai_marc&identifier=oai:arXiv.org:astro-ph/0502028">

This is similar to inverse of a DC.Identifier field -- instead of mapping
from structured to un/semi-strucutred, it maps from un/semi-strucutred
to structured.




----
Michael L. Nelson mln at cs.odu.edu http://www.cs.odu.edu/~mln/
Dept of Computer Science, Old Dominion University, Norfolk VA 23529
+1 757 683 6393 +1 757 683 4900 (f)



More information about the OAI-implementers mailing list