[OAI-general] Re: proposed collaboration: google + open citation linking

Stevan Harnad harnad@coglit.ecs.soton.ac.uk
Wed, 6 Jun 2001 19:32:59 +0100 (BST)

On Wed, 6 Jun 2001, Terry Winograd wrote:

> tw> Users don't really care what format for meta-data a site is
> tw> compliant with, but want some validation that the materials
> tw> there are "peer reviewed".

Correct. And one of the eprints.org meta-data tags is "refereed" and
another is "journal-name". It is possible to restrict search to only
those papers that have the refereed tag ON.

But I think you missed my point about content. The biggest and oldest
of the Open Archives, the Physics Archive (arXiv.org) has 150,000
papers (tiny by google standards but huge, among scholarly/scientific
disciplines, relative to what is NOT thus freely available online).

Many people have said across the years that the Physics ArXiv will
need some sort of "certifier" (and conceivably some day it will).

But to date, it is totally unnecessary. Those preprints and postprints
are all SELF-CERTIFIED as such by their authors, and with virtually no
exceptions what they say is true. Why? Because this is an esoteric
literature, written BY researchers, FOR researchers -- for their peers
(the same peers who do the peer review). And there is simply no
motivation to cheat. (What would be the motivation?)

And I should add that although there is no certification, there IS
screening, to exclude the obvious frauds, like porn, weight-loss, or
just plain quackery. 

CogPrints (cogprints.soton.ac.uk), the Cognitive Sciences Archive, also
has this screening, and it too has virtually nothing in it but
bonafide preprints and postprints.

Some day some more rigorous screening may be needed, but for now, the
only real quality controller that we need is peer-review itself, and
that is already taking place (as most of the 150,000 preprints in the
Physics Archive turn into refereed, published postprints).

The right way to think of the literature in this special set of
interoperable archives (particularly with the refereed tag ON) is
as simply a free online version of what you would find behind the
firewalls of the publishers of these journals, for a fee. 

And my proposal is that google might create a sector, or a sub-engine,
that searches all and only this interoperable set of archives (with
referee tag ON if you like).

> sh>	But for now, OAI's main objective is to get the preprints and
> sh>	postprints of refereed research up there, archived in
> sh>	OAI-compliant archives.
> tw> It seems that from the point of view of a Google user, it doesn't
> tw> matter whether they are compliant to your format or not, since
> tw> Google will do its ordinary search.  matter whether they are
> tw> compliant to your format or not, since Google will do its ordinary
> tw> search.

I think you are not contemplating the situation I have in mind:

Here is a current exact comparison. This is a google search on
"superstring" (which, by the way, is a term of art: I could have
picked much more awkward examples):


It returns 21,800 hits, the first 10 of which are not even
refereed journal articles (let alone all and only refereed
journal articles). I have no idea what most of those hits are,
but picking out the refereed articles from among them would be 
a long and tedious exercise -- and with a more awkward search term,
could become like finding a needle in a haystack. And the google-style
link-frequency metric is no help here.

But if you instead do the search with cite-base, which is restricted
to the OAI eprint Archives:


you get 1250 hits, every single one of them a preprint or postprint,
exactly as if you had done it in an indexing/abstracting database
devoted exclusively to the refereed physics literature. And you can the
hits ordered by author citation-impact, paper citation-impact, author
download-impact or paper download-impact.

What made me think of collaborating with google is that although you
might say: "Sounds like you've already got need you want there, with
cite-base!" the fact is that we still have the problem of CONTENT that
I mentioned.

Physics is doing relatively well, but even there, at the present linear
growth rate, it won't be till 2011 that that entire year's Physics
papers are all free online. And the other disciplines are much further

Google, with its wide usage for other purposes, if it implemented an
scientific/scholarly sub-engine along the lines we discussed, could
help attract all that content into the archives much much faster (while
establishing itself as the search engine for the refereed

> sh>  But what OAI will (I hope) never get involved in is the refereeing
> sh>  itself. That is not part of the function of providing
> sh>  interoperability. It is for the peer-reviewed journals and
> sh>  conferences, etc., to do the peer-reviewing and "certification" in
> sh>  that sense. OAI just provides the tagging scheme.
> tw> But it is the filtering that matters to users, not the tagging.

The filtering is done by peer review. Then the authors self-certify
their own papers as peer-reviewed, and the archive screens for
relevance and quackery. At this time, that is definitely all that is
needed. If a day comes when there is so much content up there that it
is drawing in junk too, it will not be difficult to make the screening
more rigorous, both at the archive and the archive-registry level.

Let me add that the adopters of the eprints.org archive-creating
software are universities and research institutions. And they are
doing it to increase the visibility, accessibility and impact of
their refereed research. It is in their interests to screen out the
junk too.

But right now the main objective is to draw the (pre/postprint) content
into those archives!

> sh>  I agree completely. First, a "scholarly/scientific google" would
> sh>  be the way to go (let's not leave out the nonscientific fields of
> sh>  research!),
> tw> I agree
> >and even there, it would not consist only of the pre- and
> >post-refereeing journal articles (but also books, etc., which may be
> >neither full-text nor free nor OAI-compliant).
> tw> What is the difference between "pre-refereeing" and "anything you
> tw> want to stick on the web and claim it is being sent somewhere"?  I
> tw> was assuming here that the point of having a special category was
> tw> to have the effect of refereeing so you could trust the material
> tw> more. I can't trust something because someone just says it is
> tw> "pre-referee" and puts it on some site that uses some format.

Your expectation would be reasonable, if the actual evidence were not
exactly the contrary.

We don't need archives that "referee" preprints (those would simply
be online journals!). We need archives that make the refereed papers
available to everyone for free. As a bonus, there is also the
pre-refereeing stage of the papers -- but as always, the "caveat
emptor" rule prevails with unrefereed material.

But surely, before worrying that there might be a tremendous incentive
out there for researchers to falsely label their unrefereed preprints
as a "refereed postprints" (with what journal name/date, by the way?),
you should ask yourself what is to be gained by that? especially when
balanced against what there is to lose as soon as the obvious fraud is

Best wishes,

Stevan Harnad                     harnad@cogsci.soton.ac.uk
Professor of Cognitive Science    harnad@princeton.edu
Department of Electronics and     phone: +44 23-80 592-582
             Computer Science     fax:   +44 23-80 592-865
University of Southampton         http://www.cogsci.soton.ac.uk/~harnad/
Highfield, Southampton            http://www.princeton.edu/~harnad/