[OAI] Re: [Ref-Links] economic effects of link-based search engines on e-journals

Sun, 1 Oct 2000 17:10:20 +0100 (BST)

On Sun, 1 Oct 2000, Eric Hellman wrote:

> >it would be more useful and relevant
> >for researchers if a special, google-style search engine were devised
> >that searched only the refereed research literature on keywords, and
> >then returned results on the basis of citation-link-frequency (i.e.,
> >the most cited papers on that keyword first).
> 
> I observe that google, AS IT EXISTS TODAY, works quite well in 
> returning useful and relevant results in areas (such as nitride 
> semiconductor research) where the content is available for spidering. 
> The assertion that a special purpose engine would be MORE useful is a 
> marketing claim made by Northern Light which I have not tested.

Eric is a technical expert here and I am not. So there may be something
I am not seeing or understanding, but it seems to me that the idea that
google itself, searching web-wide, is any sort of a solution for
researchers who want to search all and only the refereed journal
literature, is erroneous.

There is a huge difference (a world of difference in fact) between
either (1a) a consumer searching the whole web for products, or (1b) a
student or layman searching the whole web for information, on the one
hand, ranked on google's well-linkedness parameter, and (2) a
researcher, searching only that portion of the web that is tagged
"refereed," and ranked on citation-linkedness.

It seems to me that to have the latter, there has to be a reliable way
to (i) "sector" the web into just the refereed portion, and (ii) ensure
that the contents of that sector are fully citation-interlinked.

Nothing like this will fall out, as a side-effect, on the basis of the
larger, web-wide, link-ranking principle. What is needed is a reliable,
universal way of tagging all and only the items in this sector as
"refereed," and interlinking them by citation, and then harvesting
those, and only those. The Open Archiving Initiative (OAI) seems to
have provided the meta-data tagging protocol, the OAI-compliant
software at www.eprints.org allows institutions worldwide to create these
interoperable archives, authors can then fill them, and the
opcit.eprints.org citation-linking software, currently adapted
specifically for the Los Alamos Physics, can be adapted as an open
archives service applied to all the harvested eprint archives.
Dedicated search engines can then operate on that corpus alone, instead
of the whole web.

> The interesting thing to me is that by virtue of its 
> interlinked-ness, scholarly literature tends to rank high in google 
> even without prefiltering. In some cases, interference is a problem. 
> For example, if you try to look for InN (indium nitride), you get a 
> lot of hotels and Bed-and-Breakfasts.

Here the fact that I am not technical does not disqualify me: I can say
with absolute certainty that google as a way of retrieving (what there
is of) the refereed literature on the web, and that literature only, is
completely hopeless. What a user should be able to do (with the
restricted sector and searcher I am describing) is precisely the same
thing he does if he executes, say, a search in Medline, or Inspec, or
Web of Science: He should be able to retrieve all and only the refereed
literature (but citation-ranked and full-text). No wading through
"Bed-and-Breakfasts" and thousands of other irrelevant items.

> Google is uncanny. For example, it knows to classify "Harnad" in the 
> category "Logic and Ontology:Natural Kinds".

It's interesting that it got that, based on linkedness, but that is in
fact far from being the best or most useful first-cut classification of
my work. It is no doubt an artifact of the linkedness-ranking. If you
did it in the refereed sector, using citation linking, you would get
much more accurate and useful categories.

> >For this, the refereed (and pre-refereeing) literature needs to be:
> >
> >(1) identifiable by agreed upon meta-data tagging:
> >     http://www.openarchives.org
> 
> Good, but not strictly essential. It is a matter of current 
> controversy in the search engine community as to whether metadata is 
> useful at all in open, automated environments. Of course meta tagging 
> is very useful for other applications.

The OAI-protocol, and registration as an OA data-provider, as I
understand it, makes it possible to selectively harvest the contents of
those archives, and those archives alone. (Web-wide, the meta-data would
be buried in a lot else, and probably not even unique.)

    http://www.openarchives.org/sfc/sfc_archives.htm

> >(2) online (preferably full-text and free):
> >     http://www.eprints.org
> 
> Necessary, but not sufficient. Content must also be available to 
> robots. The Los Alamos Archive is a prominent example of a site where 
> robots are unwelcome.

I agree completely. Not being a technical person, I cannot say how, but
I have a gut feeling that there will be a way to allow the registered,
OAI-compliant eprint archives to be automatically harvested. (In fact, I
bet that such an automatic harvester will be among the first registered
OA service-providers -- and searchers and citation-linkers will not be
far behind).

    http://www.openarchives.org/sfc/sfc_services.htm

> >and
> >
> >(3) fully citation-linked:
> >     http://opcit.eprints.org
> >
> 
> Again, necessary, but not sufficient. The links must be 
> robot-friendly. Feel free to contact me if you want details; this is 
> a technical subject.

I agree that they must be robot-friendly. Once we at Southampton have
released the final version of the (free) OAI-compliant
Eprint-archive-creating software, we will be working on providing
citation-linking services and perhaps harvesters. We will certainly
make sure that, at least for authorized daily harvesting
service-providers (whose selection can then be searched be real people),
the OAI-compliant Eprint Archives permit automatic harvesting. CogPrints
has no robot restriction -- although, admittedly, at only 1/130th of the
size of Los Alamos, it has not yet reached the size where it might need
one: CogPrints, however, like Los Alamos, is a CENTRALIZED Eprint
Archive. Once there are distibuted, institutional Eprint Archives, each
holding only their own researchers' refereed papers, the harvesting and
robot problem might not come up.

--------------------------------------------------------------------
Stevan Harnad                     harnad@cogsci.soton.ac.uk
Professor of Cognitive Science    harnad@princeton.edu
Department of Electronics and     phone: +44 23-80 592-582
             Computer Science     fax:   +44 23-80 592-865
University of Southampton         http://www.cogsci.soton.ac.uk/~harnad/
Highfield, Southampton            http://www.princeton.edu/~harnad/
SO17 1BJ UNITED KINGDOM           

NOTE: A complete archive of the ongoing discussion of providing free
access to the refereed journal literature online is available at the
American Scientist September Forum (98 & 99 & 00):

    http://amsci-forum.amsci.org/archives/september98-forum.html

You may join the list at the site above.

Discussion can be posted to:

    september98-forum@amsci-forum.amsci.org