[OAI-general] Re: Cliff Lynch on Institutional Archives

Stevan Harnad harnad@ecs.soton.ac.uk
Sun, 16 Mar 2003 14:15:56 +0000 (GMT)

On Sat, 15 Mar 2003, Thomas Krichel wrote:

>   Stevan Harnad writes:
>sh> There is no need -- in the age of OAI-interoperability -- for
>sh> institutional archives to "feed" central disciplinary archives:
>   I do not share what I see as a  blind faith in interoperability
>   through a technical protocol. 

I am quite happy to defer to the technical OAI experts on this one, but let
us put the question precisely: 

Thomas Krichel suggests that institutional (OAI) data-archives
(full-texts) should "feed" disciplinary (OAI) data-archives,
because OAI-interoperability is somehow not enough. I suggest that
OAI-interoperability (if I understand it correctly) should be enough. No
harm in redundant archiving, of course, for backup and security, but not
necessary for the usage and functionality itself. In fact, if I understand
correctly the intent of the OAI distinction between OAI data-providers -- 
-- and OAI service-providers --
-- it is not the full-texts of data-archives that need to be "fed" to
(i.e., harvested by) the OAI service providers, but only their metadata.

Hence my conclusion that distributed, interoperable OAI institutional
archives are enough (and the fastest route to open-access). No need
to harvest their contents into central OAI discipline-based archives
(except perhaps for redundancy, as backup). Their OAI interoperability
should be enough so that the OAI service-providers can (among other things)
do the "virtual aggregation" by discipline (or any other computable
criterion) by harvesting the metadata alone, without the need to harvest
full-text data-contents too.

It should be noted, though, that Thomas Krichel's excellent RePec
archive and service in Economics -- http://repec.org/ -- goes
well beyond the confines of OAI-harvesting! RePec harvests non-OAI
content too, along lines similar to the way ResearchIndex/citeseer --
http://citeseer.nj.nec.com/cs -- harvests non-OAI content in computer
science. What I said about there being no need to "feed" institutional OAI
archive content into disciplinary OAI archives certainly does not apply
to *non-OAI* content, which would otherwise be scattered willy-nilly
all over the net and not integrated in any way. Here RePec's and
ResearchIndex's harvesting is invaluable, especially as RePec already
does (and ResearchIndex has announced that it plans to) make all its
harvested content OAI-compliant!

To summarize: The goal is to get all research papers, pre- and
post-peer-review, openly accessible (and OAI-interoperable) as soon as
possible. (These are BOAI Strategies 1 [self-archiving] and 2
[open-access journals]: http://www.soros.org/openaccess/read.shtml
). In principle this can be done by (1) self-archiving them in central
OAI disciplinary archives like the Physics arXiv (the biggest and
first of its kind) -- http://arxiv.org/show_monthly_submissions
-- by (2) self-archiving them in distributed institutional OAI
Archives -- http://www.ecs.soton.ac.uk/~harnad/Temp/tim.ppt -- by (3)
self-archiving them on arbitrary Web and FTP sites (and hoping they
will be found or harvested by services like Repec or ResearchIndex)
or by (4) publishing them in open-access journals (BOAI Strategy 2:
http://www.soros.org/openaccess/journals.shtml ).

My point was only that because researchers and their institutions
(*not* their disciplines) have shared interests vested in maximizing
their joint research impact and its rewards, institution-based
self-archiving (2) is a more promising way to go -- in the age of
OAI-interoperability -- than discipline-based self-archiving (1), even
though the latter began earlier. It is also obvious that both (1) and
(2) are preferable to arbitrary Web and FTP self-archiving (3), which
began even earlier (although harvesting arbitrary Website and FTP contents
into OAI-compliant Archives is still a welcome makeshift strategy
until the practise of OAI self-archiving is up to speed). Creating new
open-access journals and converting the established (20,000) toll-access
journals to open-access is desirable too, but it is obviously a much
slower and more complicated path to open access than self-archiving,
so should be pursued in parallel.

My conclusion in favor of institutional self-archiving is based on the
evidence and on logic, and it represents a change of thinking,
for I had originally advocated (3) Web/FTP self-archiving --
http://www.arl.org/scomm/subversive/toc.html -- then switched allegiance
to central self-archiving (1), even creating a discipline-based archive:
http://cogprints.ecs.soton.ac.uk/ But with the advent of OAI in 1999,
plus a little reflection, it became apparent that
institutional self-archiving (2) was the fastest, most direct, and most
natural road to open access: http://www.eprints.org/ 
And since then its accumulating momentum seems to be confirming that this
is indeed so: http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2212.html

>   The primary sense of belonging
>   of a scholar in her research activities is with the disciplinary
>   community of which she thinks herself a part... It certainly
>   is not with the institution. 

That may or may not be the case, but in any case it is irrelevant to
the question of which is the more promising route to open-access. Our
primary sense of belonging may be with our family, our community,
our creed, our tribe, or even our species. But our rewards (research
grant funding and overheads, salaries, postdocs and students attracted
to our research, prizes and honors) are intertwined and shared with our
institutions (our employers) and not our disciplines (which are often
in fact the locus of competition for those same rewards!)

>   Therefore, if you want to fill
>   institutional archives---which I agree is the best long-run way
>   to enhance access and preservation to scholarly research--- [the]
>   institutional archive has to be accompanied by a discipline-based
>   aggregation process. 

But the question is whether this "aggregation" needs to be the "feeding"
of institutional OAI archive contents into disciplinary OAI archives, or
merely the "feeding" of OAI metadata into OAI services.

>    The RePEc project has produced such an aggregator
>   for economics for a while now. I am sure that other, similar
>   projects will follow the same aims, but, with the benefit of
>   hindsight, offer superior service. The lack of such services
>   in many disciplines,  or the lack of interoperability between
>   disciplinary and  institutional archives, are major obstacle to
>   the filling  the institutional archives.  There are no
>   inherent contradictions between institution-based archives
>   and disciplinary aggregators,

There is no contradiction. In fact, I suspect this will prove to be a
non-issue, once we confirm that (a) we agree on the need for
OAI-compliance and (b) "aggregation" amounts to metadata-harvesting and
OAI service-provision when the full-texts are in the institutional
archive are OAI-compliant (and calls for full-text harvesting only
if/when they are not). Content "aggregation," in other words, is a
paper-based notion. In the online era, it merely means digital sorting
of the pointers to the content.

>   In the paper that Stevan refers to, Cliff Lynch writes,
>   at http://www.arl.org/newsltr/226/ir.html
>cl> But consider the plight of a faculty member seeking only broader
>cl> dissemination and availability of his or her traditional journal
>cl> articles, book chapters, or perhaps even monographs through use of
>cl> the network, working in parallel with the traditional scholarly
>cl> publishing system.
>   I am afraid, there more and more such faculty members. Much
>   of the research papers found over the Internet are deposited
>   in the way. This trend is growing not declining.

You mean self-archiving in arbitrary non-OAI author websites? There is
another reason why institutional OAI archives and official institutional
self-archiving policies (and assistance) are so important. In reality,
it is far easier to deposit and maintain one's papers in institutional
OAI archives like Eprints than to set up and maintain one's own website.
All that is needed is a clear official institutional policy, plus
some startup help in launching it. (No such thing is possible at a
"discipline" level.)

>cl> Such a faculty member faces several time-consuming problems. He or
>cl> she must exercise stewardship over the actual content and its
>cl> metadata: migrating the content to new formats as they evolve over
>cl> time, creating metadata describing the content, and ensuring the
>cl> metadata is available in the appropriate schemas and formats and
>cl> through appropriate protocol interfaces such as open archives
>cl> metadata harvesting.
>   Sure, but academics do not like their work-, and certainly
>   not their publishing-habits, [to] be interfered with by external
>   forces. Organizing academics is like herding cats!

I am sure academics didn't like to be herded into publishing with the
threat of perishing either. Nor did they like switching from paper to
word-processors. Their early counterparts probably clung to the oral
tradition, resisting writing too; and monks did not like be herded from
their peaceful manuscript-illumination chambers to the clamour of
printing presses. But where there is a causal contingency -- as there is
between (a) the research impact and its rewards, which academics like as
much as anyone else, and (b) the accessibility of their research -- academics
are surely no less responsive than Prof. Skinner's pigeons and rats to
those causal contingencies, and which buttons they will have to press 
in order to maximize their rewards!

Besides, it is not *publishing* habits that need to be changed, but
*archiving* habits, which are an online supplement, not a substitute,
for existing (and unchanged) publishing habits.

>cl> Faculty are typically best at creating new
>cl> knowledge, not maintaining the record of this process of
>cl> creation. Worse still, this faculty member must not only manage
>cl> content but must manage a dissemination system such as a personal Web
>cl> site, playing the role of system administrator (or the manager of
>cl> someone serving as a system administrator).
>   There are lot of ways in which to maintain a web site or to get
>   access to a maintained one. It is a customary activity these days and
>   no longer requires much technical expertise. A primitive integration
>   of the contents can be done by Google, it requires  no metadata.
>   Academics don't care  about long-run preservation, so that problem
>   remains unsolved. In the meantime, the academic who uploads papers to a web
>   site takes steps to resolve the most pressing problem, access.

Agreed. And uploading it into a departmental OAI Eprints Archive is 
by far the simplest way and most effective way to do all of that. All it
needs is a policy to mandate it:

>cl> Over the past few years, this has ceased to be a reasonable activity
>cl> for most amateurs; software complexity, security risks, backup
>cl> requirements, and other problems have generally relegated effective
>cl> operation of Web sites to professionals who can exploit economies of
>cl> scale, and who can begin each day with a review of recently issued
>cl> security patches.
>   These are technical concerns. When you operate a linux box
>   on the web you simply fire up a script that will download
>   the latest version. That is easy enough. Most departments
>   have separate web operations. Arguing for one institutional
>   archive for digital contents is akin to calling for a single web
>   site for an institution. The diseconomies of scale of central
>   administration impose other types of costs that the ones that it was to
>   reduce. The secret is to find a middle way.

I couldn't quite follow all of this. The bottom line is this: The free
Eprints.org software (for example) can be installed within a few days. It
can then be replicated to handle all the departmental or research group
archives a university wants, with minimal maintenance time or costs. The
rest is just down to self-archiving, which takes a few minutes for the
first paper, and even less time for subsequent papers (as the repeating
metadata -- author, institution, etc., can be "cloned" into each new
deposit template). An institution may wish to impose an institutional
"look" on all of its separate eprints archives; but apart from that,
they can be as autonomous and as distributed and as many as desired:
OAI-interoperability works locally just as well as it does globally.

>cl> Today, our faculty time is being wasted, and expended ineffectively,
>cl> on system administration activities and content curation. And,
>cl> because system administration is ineffective, it places our
>cl> institutions at risk: because faculty are generally not capable of
>cl> responding to the endless series of security exposures and patches,
>cl> our university networks are riddled with vulnerable faculty machines
>cl> intended to serve as points of distribution for scholarly works.
>   This is the fight many faculty face every day, where they
>   want to innovate scholarly communication, but someone
>   in the IT department does not give the necessary permission
>   for network access...

I don't think I need to get into this. It's not specific to
self-archiving, and a tempest in a teapot as far as that is concerned. An
efficient system can and will be worked out once there is an effective
institutional self-archiving policy. There are already plenty of excellent
examples, such as CalTech: 
See also:

Stevan Harnad