[OAI-general] Re: Interoperability - subject classification/terminology

Thu, 27 Mar 2003 15:17:42 +0000 (GMT)

I suppose it was useful, overall, that Chris Gutteridge (unintentionally,
as it turned out) branched his original posting, which began this line
of discussion, to a number of other lists, apart from the OAI-general
list for which it was intended. I have since gradually phased out the
other lists from the discussion, but the value of the exercise was,
I think, in illustrating both the overlap and the distinctness between
the two "OA" movements: (1) the OAI (Open Archives Initiative), with its
technical mandate being to provide digital interoperability, and its
target being the entire digital library, and (2) the BOAI (Budapest Open
Access Initiative), with its activist mandate being to provide open,
free full-text access, and its target being the peer-reviewed research
literature only (and particularly its authors and their institutions).

What is relevant, even central, to the one, can accordingly be not only
irrelevant but downright misleading to the other. I reply below
accordingly:

On Thu, 27 Mar 2003, Hussein Suleman wrote:

> well, sure, i agree in principle ... if arXiv and similar projects agree 
> to bunch all physics into a single category and use google for 
> searching, with no browsing capabilities, it wouldnt be a problem at all.

The Physics ArXiv is a growing centralized subset of the Physics
literature, with its own native search capability, including a taxonomy.
It is the only such archive -- in size, scale and use -- and started
well before the OAI and BOAI. The OAI (also partly inspired by ArXiv)
has now introduced the possibility of distributed archiving, across
disciplines, integrated through interoperability. As such, it has
augmented or even replaced the notion of (1) centralized, discipline-based
archiving, with native search capabilities, by the notion of (2)
distributed, institution-based archiving, with separate OAI search services.

How (and whether) to preserve ArXiv's native search and taxonomy
functions is a technical question I leave to the experts. (One naive
thought is that the taxonomic decriptors applying to a paper are rather
like keywords, so a flat string of them could be preserved as a free-text
keyword-field, which would then be searchable in the usual boolean way;
there are probably tricks for preserving their hierarchical structure too,
if need be.)

But the point is that there are no "similar projects" -- at least not
among the preprint/postprint corpora covered by BOAI. There might be
among the broader kinds of digital collections covered by OAI, but that
is another matter, and it has to be kept distinct from the BOAI, whose
concern is with getting full-text open access to the entire
preprint/postprint literature, across all disciplines and institutions,
and as soon as possible, and hence with a minimum of obstacles (of which
the design and application of discipline-specific taxonomies, by way
of a prerequisite or constraint, would indeed be one).

To set one's intuitions, it is best to imagine searching the ISI
(Institute for Scientific Information) database, which is
multidisciplinary and covers the metadata, abstracts, and references for
about 7500 journals. Imagine this augmented to all disciplines,
all journals (about 20,000) and full-text. ISI has some very general
discipline classifiers, but that's all. And that's all that's needed to
confer a wealth of searching/navigation power, especially once augmented
by google-style full-text boolean search. No doubt such a corpus could
and would be augmented by further metaclassification schemes, but those
will be derived algorithmically, a posteriori, from the corpus itself,
rather than as a human pre-tagging, pre-classification process, applied
to each article as it is entered.

(Alerting, for example, would be a customized boolean rule, and probably
agent-based, applied across archives, rather than being a local-archive,
taxonomy-based function.)

> similarly, if we grouped together computer science, electrical 
> engineering and information systems, that would be ok for gross-level 
> interoperability ... once again, assuming searching is the only service 
> required. frankly, i think this is a little simplistic and assumes 
> digital libraries are no more than submission+search systems.

Digital libraries are no doubt more than that. But for the special
subset of the digital corpus that is the sole focus of the open-access
movement (the peer-reviewed research literature) and its main users
(researchers), searching is indeed the only service required. (This
of course includes scientometric as well as agent-based search.)
http://www.ecs.soton.ac.uk/~harnad/Temp/Ariadne-RAE.htm

> [aside: why does eprints support browsing by catgeories ?]

Good question! My answer would be that it is merely to support local
functionality. Whereas no one else on the planet may wish to search
only the Southampton ECS department's archive for work on agent-based
auctions, someone here at Southampton might. (But even this could be
done by a suitably constructed boolean search, shrewdly using the OAI
tags as well as the full-text, via a cross-archive search engine.) I
would be inclined to agree that, as institutional archives proliferate
and grow, their local search and taxonomic functions will fade out,
eclipsed by the powerful OAI cross-archive search capabilities.
http://www.ecs.soton.ac.uk/~lac/archpol.html

> besides, who decides what constitutes a discipline anyway ? has anyone 
> ever been able to decide if computer science is engineering or science ?

If you think classifying disciplines is arbitrary, consider the
fuzziness of the rest of the taxonomic tree!

> i think we have more questions than answers here and it isnt as simple 
> as you point out or we wouldnt even be discussing this :)

I am sure there are taxonomic complexities crucial for the digital library
in general that exceed my imagination and escape my technical inexpertise,
but I doubt they pertain to the peer-reviewed research literature (20K
journals) in particular -- and the immediate and urgent need of its
would-be users, which is for complete, full-text, open access online.

Stevan Harnad