[OAI-implementers] RSS Feed Form the UIUC OAI Registry

Thomas G. Habing thabing@uiuc.edu
Fri, 21 Nov 2003 10:53:31 -0600

Michael Nelson wrote:

> On Wed, 5 Nov 2003, Thomas G. Habing wrote:
>>Last night I also ran my gOAIglePop script.  This script programatically 
>>does some Google searches, looking for OAI repositories.  If it finds a URL 
>>which appears to be an OAI repository it issues an Identify request.  If it 
>>gets a valid response, its found a repository.  The results of the script 
>>run can be found at http://gita.grainger.uiuc.edu/registry/gOAIgle.xml The 
>>latest run found three previously unknown repositories (at least to the 
>>registry).  If anyone is interested, the best Google query I've found for 
>>finding OAI repositories is 'allinurl:verb=Identify'.  Type this into the 
>>Google query textbox and press Search.
> also very cool...  suggestion: perhaps do some normalization of the URLs?
> or at least normalize based on the Identify response?  for example, you
> found at least one of my repos twice:
> <baseURL>http://naca.larc.nasa.gov/oai2.0/</baseURL> 
> <baseURL>http://naca.larc.nasa.gov/oai2.0/index.cgi</baseURL> 
> which are the same repositories and give the same responses in Identify.

This is something I've been struggling with.  I've actually done a fair 
amount of manual cleanup in my registry to get rid of duplicate repositories 
that appear with slightly different baseURLs.  Discovering these can 
actually be kind of tricky because of domain name aliases, redirects, and 
other reasons.  In some cases I've found three or more different baseURLs 
for the same repository.

Duplicates seem to arise for various reasons:

   Domain name aliases

   URLs that sometimes use the numeric IP address and sometimes the domain

   URLs that sometimes explicitly include the port # 80 and sometimes not

   URLs that sometimes explicitly include the script name and other times
   rely on the default, as the above examples

   HTTP redirects

   Probably other reasons...

The rules that I've used for resolving duplicates include:

If the baseURL returned by the Identify response is the same regardless of 
the URL originally requested, and that baseURL actually works (on rare 
occasions they haven't) I use that baseURL.

For many repositories, it seems that the baseURL reported in the Identify 
response, simple reflects the URL originally used for the request.  In these 
cases, if I've discovered multiple URLs for the same repository, I will use 
the baseURL which is shortest.

Anyway, now that I have a good size registry built up, I am being more 
careful in adding new repositories to prevent duplicates.  I am also working 
on ideas to better automate the discovery of possible duplicates, such as 
URL normalization, domain name lookups, or Identify response comparisons.

If anyone has any ideas please share them.


Thomas Habing
Research Programmer, Digital Library Projects
University of Illinois at Urbana-Champaign
155 Grainger Engineering Library Information Center, MC-274
thabing@uiuc.edu, (217) 244-4425