[OAI-implementers] OAI-PMH baseURL discovery

David Valentine valentine at library.ucsb.edu
Sun Feb 13 16:35:19 EST 2005


On Feb 13, 2005, at 1:03 PM, Andy Powell wrote:

> I agree with your conclusion that it is sensible to adopt both  
> approaches
> #2 and #3.
>
> I'm not sure what mechanisms are available for updating the REP though?
> The REP pages are at
>

May not need to change, but just restarted, and driven by the community.

Using the last rfc would not break anything, and would allow for the  
easy specification of the information
  http://www.robotstxt.org/wc/norobots-rfc.html

Add a standard harvester name, and use the proposed allow extension to  
specify the path.

     User-agent: OAIPMHbaseURL
      Allow:   /path_to_oai

The rules should be written so that if your harvester has a user agent  
which is disallowed, then you should honor that request.


> I disagree with your implied preference for using <meta> rather than
> <link>.  In this case, we clearly want to provide a link to another
> resource - therefore the semantics of the <link> tag are much more
> appropriate than the semantics of the <meta> tag.
>
> I also suspect that your suggested use of
>
> OAIPMHbaseURL="..."
>
> and
>
> OAIPMHrecord="..."
>
> break the (X)HTML specs (though I haven't checked)?
>
> Andy.
>
> On Sun, 13 Feb 2005, Michael Nelson wrote:
>
>>
>> (this is in response to Andy's mesg:
>> http://www.openarchives.org/pipermail/oai-implementers/2005-February/ 
>> 001407.html)
>>
>> Drawing from our experience with mod_oai, we see at least 4 possible
>> ways for robots to "automatically" discover OAI-PMH baseURLs:
>>
>> 1.  develop a separate file, oaimph.txt, similar in spirit to  
>> robots.txt
>>
>> 2.  add to the existing robots.txt file
>>
>> 3.  use HTML link or META tags for robots
>>
>> 4.  use the <friends> component in the Identify response.
>>
>> We do not prefer #1 - a separate file for robots to check seems  
>> unlikely
>> to encourage widespread adoption.
>>
>> We prefer #2 because it injects OAI-PMH into the regular web
>> mechanics where it belongs.  Robots already look for this file -
>> why not put OAI-PMH statements where they expect to find guidance?
>> Similarly, a robots.txt file is easy to install and edit (certainly
>> easier than installing most repository software packages), so there
>> will be no additional burden on a repository administrator.
>>
>> #3 can be used in some cases, but it makes an assumption that every
>> repository we would like a robot to find has an HTML presence.  #2  
>> and #3
>> can be used separately since they address separate use cases.
>>
>> #4 is important and needs to be reinforced as a way of repositories
>> "pointing" to each other.  You can't bootstrap baseURL discovery via
>> <friends>, but once a robot knows about a single baseURL, it should be
>> able to assemble a list of cooperating repositories.  No new  
>> functionality
>> is needed for <friends>, but the robot scenario increases the  
>> importance
>> of its use.
>>
>> robots.txt
>> ----------
>>
>> The "problem" with robots.txt is that the syntax is very simple and is
>> focused on telling robots what they can't do and not on what they  
>> should
>> do.  So in addition to having a line such as:
>>
>> OAIPMHbaseURL: http://cs1.ist.psu.edu/cgi-bin/oai.cgi
>>
>> We would like to expand the syntax of the "Disalllow:" tag to include
>> alternatives:
>>
>> Disallow: /citations/ OAIPMHbaseURL:
>> http://cs1.ist.psu.edu/cgi-bin/oai.cgi
>>
>> Where the 2nd line is the alternate access for how to get at the
>> information prohibited in the Disallow.  Depending on how robust
>> robots are with respect to extended syntax, we could repeat the line
>> in case the extended line is not understood:
>>
>> Disallow: /citations/
>> Disallow: /citations/  OAIPMHbaseURL:
>> http://cs1.ist.psu.edu/cgi-bin/oai.cgi
>>
>> HTML Tags for Robots
>> --------------------
>>
>> It would be useful to tie an existing HTML page back to the original
>> OAI-PMH repository from which it came, such as:
>>
>> http://uk.arxiv.org/abs/astro-ph/0502028
>>
>> having something like:
>>
>> <META NAME="ROBOTS" OAIPMHbaseURL="http://www.arxiv.org/oai2">
>>
>> It would also be useful to tie the HTML representation back to
>> the structured metadata from which it came:
>>
>> <META NAME="ROBOTS"
>> OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
>> ataPrefix=oai_dc&identifier=oai:arXiv.org:astro-ph/0502028">
>>
>> <META NAME="ROBOTS"
>> OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
>> ataPrefix=oai_marc&identifier=oai:arXiv.org:astro-ph/0502028">
>>
>> This is similar to inverse of a DC.Identifier field -- instead of  
>> mapping
>> from structured to un/semi-strucutred, it maps from un/semi-strucutred
>> to structured.
>>
>> comments welcome,
>>
>> Michael Nelson & Herbert Van de Sompel
>>
>> ----
>> Michael L. Nelson mln at cs.odu.edu http://www.cs.odu.edu/~mln/
>> Dept of Computer Science, Old Dominion University, Norfolk VA 23529
>> +1 757 683 6393 +1 757 683 4900 (f)
>>
>> _______________________________________________
>> OAI-implementers mailing list
>> List information, archives, preferences and to unsubscribe:
>> http://www.openarchives.org/mailman/listinfo/oai-implementers
>>
>>
>
> Andy
> --
> Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
> http://www.ukoln.ac.uk/ukoln/staff/a.powell/      +44 1225 383933
> Resource Discovery Network http://www.rdn.ac.uk/
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://www.openarchives.org/mailman/listinfo/oai-implementers
>




More information about the OAI-implementers mailing list