[OAI-implementers] repository auto-discovery

John S. Erickson john.erickson at hp.com
Sun Nov 19 16:00:27 EST 2006


Michael says, "...we see at least 3 possible ways for robots to 
"automatically" discover OAI-PMH baseURLs..."

I don't understand why you don't include baseURLs specified within 
<friends> elements of a OAI-PMH response as another "way." Granted, this 
might be more of a p2p approach, but it technically *could be* a way for 
a robot to discover baseURLs.

This is how we're accomplishing "peer federation" in our pf-dspace 
project, in which dspace instances tell their peers the baseURLs of 
dspaces they know about by publishing lists of <friends> via oai-pmh.

Michael Nelson wrote:
>> well, i'm aiming at something much lower: namely, how to get the baseUrl
>> of an OAI PMH data provider? and it seems particularly embarassing, that
>> i have no standard way to advertise my own service to people (including
>> robots) surfing my own pages.
> 
> Herbert and I talked about this some time ago and had a preference for
> adding to robots.txt to inform crawlers about baseURLs.  At the time, few
> outside of the DL community were supporting OAI-PMH, but perhaps it is
> time to revisit this.  Here is the proposal; the syntax could be tweaked
> w/ robots.txt "Allow:", HTML <link> etc., but this should give the idea:
> 
> ===
> 
> OAI-PMH baseURL discovery
> 
> Drawing from our experience with mod_oai, we see at least 3 possible
> ways for robots to "automatically" discover OAI-PMH baseURLs:
> 
> 1.  develop a separate file, oaimph.txt, similar in spirit to robots.txt
> 
> 2.  add to the existing robots.txt file
> 
> 3.  use HTML link or META tags for robots
> 
> We do not prefer #1 - a separate file for robots to check seems unlikely
> to encourage widespread adoption.
> 
> We prefer #2 because it injects OAI-PMH into the regular web
> mechanics where it belongs.  Robots already look for this file -
> why not put OAI-PMH statements where they expect to find guidance?
> 
> #3 can be used in some cases, but it makes an assumption that every
> repository we would like a robot to find has an HTML presence.  #2 and #3
> can be used separately since they address separate use cases.
> 
> robots.txt
> ----------
> 
> The "problem" with robots.txt is that the syntax is very simple and is
> focused on telling robots what they can't do and not on what they should
> do.  So in addition to having a line such as:
> 
> OAIPMHbaseURL=http://cs1.ist.psu.edu/cgi-bin/oai.cgi
> 
> We would like to expand the syntax of the "Disalllow:" tag to include
> alternatives:
> 
> Disallow: /citations/   http://cs1.ist.psu.edu/cgi-bin/oai.cgi
> 
> Where the 2nd line is the alternate access for how to get at the
> information prohibited in the Disallow.  Depending on how robust
> robots are with respect to extended syntax, we could repeat the line
> in case the extended line is not understood:
> 
> Disallow: /citations/
> Disallow: /citations/   http://cs1.ist.psu.edu/cgi-bin/oai.cgi
> 
> HTML Tags for Robots
> --------------------
> 
> It would be useful to tie an existing HTML page back to the original
> OAI-PMH repository from which it came, such as:
> 
> http://uk.arxiv.org/abs/astro-ph/0502028
> 
> 
> having something like:
> 
> <META NAME="ROBOTS" OAIPMHbaseURL="http://www.arxiv.org/oai2">
> 
> It would also be useful to tie the HTML representation back to
> the structured metadata from which it came:
> 
> <META NAME="ROBOTS"
> OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
> ataPrefix=oai_dc&identifier=oai:arXiv.org:astro-ph/0502028">
> 
> <META NAME="ROBOTS"
> OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
> ataPrefix=oai_marc&identifier=oai:arXiv.org:astro-ph/0502028">
> 
> This is similar to inverse of a DC.Identifier field -- instead of mapping
> from structured to un/semi-strucutred, it maps from un/semi-strucutred
> to structured.
> 
> 
> 
> 
> ----
> Michael L. Nelson mln at cs.odu.edu http://www.cs.odu.edu/~mln/
> Dept of Computer Science, Old Dominion University, Norfolk VA 23529
> +1 757 683 6393 +1 757 683 4900 (f)
> 
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://www.openarchives.org/mailman/listinfo/oai-implementers
> 



More information about the OAI-implementers mailing list