[OAI-general] OAI and web crawlers

Eric Hellman eric@openly.com
Tue, 24 Jul 2001 11:10:32 -0400

Well, you can pay Inktomi about 25 cents per record to crawl your 
site. In the current economic environment, you won't find much 
interest from the mainstream search engines unless they add to the 
bottom line.

Google is probably the most sophisticated crawler. Google ignores all 
metadata that you provide it, on the reasonable assumption that all 
webmasters stuff keywords to try to rig search results. Google will 
follow a link into a DL; having a fixed, unique URL for each item 
will maximize traffic from google.

Google has roots in the Stanford DL program, an approach them from 
that direction.

You can see the results of our "overtures to the webcrawling 
community". Search for "the origins of chinese communism" at google.

Northern Light is probably a good first target for outreach because 
they build specialty search engines.

Danny Sullivan's  Search Engine Watch. at 
http://www.searchenginewatch.com/ is a good place to learn more. 
Getting Danny interested in OAI would  be a good way to reach out to 
this community.


At 3:34 PM -0400 7/23/01, Michael L. Nelson wrote:
>I've just added:
>	# please use our Open Archives Initiative (OAI) interface instead!
>	# http://naca.larc.nasa.gov/oai/
>	# see http://www.openarchives.org/ for more info
>to my robots.txt file for my two DLs (LTRS & NACATRS).  I doubt these
>messages will be read by humans, but stranger things have happened.
>if your DL is like mine, at any given time webcrawlers from Inktomi,
>Google, etc. are meandering about.  I don't discourage this behaivor
>(cf. arXiv), mostly because its never been too much of a problem. 
>but it would seem that these crawlers would benefit using the OAI
>interface where possible.  OAI is doing quite well within the publishing /
>library community, but has anyone made any overtures to the webcrawling
>community?  Any ideas on how to do so?  I would expect the potential for
>reduced network traffic and increased indexed content could cause them to
>modify their robots to understand OAI...
Eric Hellman
Openly Informatics, Inc.
1cate: 1-Click Access To Everything