[OAI-general] Piece in Nature re OAI

Declan Butler d.butler@nature-france.com
Wed, 5 Sep 2001 22:40:42 +0200


C'est un message de format MIME en plusieurs parties.

------=_NextPart_000_0070_01C1365B.CB266720
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 8bit

Dear all
I thought this piece, which will be published in tomorrow's edition of
Nature may be of interest to you. Forgive me if it skimps on some of the
more detailed technical issues of OAI, but I hope it gets across the main
points on some of these issues to a wider audience. I'd be interested to
hear any of your comments/criticisms.
Also available on free-access on
http://www.nature.com/nature/debates/e-access/
Declan Butler
European correspondent
Nature
d.butler@nature-france.com
       06 September 2001


      Nature 413, 1 - 3 (2001) © Macmillan Publishers Ltd.


The future of the electronic scientific literature



The Internet's transformation of scientific communication has only begun,
but already much of its promise is within reach. The vision below may change
in its detail, but experimentation and lack of dogmatism are undoubtedly the
way forward.


"The Internet is easier to invent than to predict" is a maxim that time has
proven to be a truism. Much the same might be said of scientific publishing
on the Internet, the history of which is littered with failed predictions.
Technological advance itself will, of course, bring dramatic changes — and
it is a safe bet that bright software minds will punctually overturn any
vision. But it is becoming clear that developing common standards will be
critical in determining both the speed and extent of progress towards a
scientific web.

'Standards' for managing electronic content are hardly a riveting topic for
researchers. But they are key to a host of issues that affect scientists,
such as searching, data mining, functionality and the creation of stable,
long-term archives of research results. Moreover, just as the Internet and
web owe their success to agreed network protocols on which others were able
to build, common standards in science will provide a foundation for a
diversity of publishing models and experiments and be a better alternative
to 'one-size-fits-all' solutions.

This explains why the Open Archives Initiative (OAI), one of many
alternatives now being offered to scientists to disseminate their work, has
now broadened its focus from e-prints to promoting common web standards for
digital content.

The reason is that some of the most promising emerging technologies will
only realize their full promise if they are adopted in a consensual fashion
by entire communities. At the level of the online scientific 'paper', one
major change, for example, is a shift in format to make papers more
computer-readable. Searches will become much more powerful; tables and
figures will cease to be flat, lifeless objects, and instead will be able to
be queried and manipulated by users, using suites of online visualization
and data-analysis tools.

This is being made possible by Extensible Mark-up Language (XML), which
allows a document to be tagged with machine-readable 'metadata', in effect
converting it into a sort of mini-database. Most web pages today are coded
in HTML. But this contains information only about a page's appearance.
Whereas HTML specifies title and author information, for example as simple
headings, such as:
<H1> The future of the electronic scientific literature </H1>
<H3>by John Smith</H3>
XML specifies these in a way that computers can understand:
<articletitle> The future of the electronic scientific literature
</articletitle> <author><firstname>John</firstname> <lastname>
Smith</lastname>.

The possibilities for tagging are endless. But a major need now is for
stakeholders to agree on common metadata standards for the basic structure
of scientific papers. This would allow more specific queries to be made
across large swathes of the literature. Indeed, what is above all hampering
the usefulness of today's online journals, e-print archives and scientific
digital libraries is the lack of means to federate these resources through
unified interfaces.

The OAI has agreed metadata standards to facilitate improved searching
across participating archives, which can therefore be queried by users as if
they were one seamless site. The OAI is attractive compared with centralized
archives in that it allows any group to create an archive while, by agreeing
common standards, they become part of a greater whole. The idea is catching
on: it is supported by the Digital Library Federation (DLF), a consortium of
US libraries and agencies, including the Online Computer Library Center.
CrossRef, a collaboration of 78 learned society and commercial publishers,
in which Nature's publishers are taking a leading role, is also actively
developing common metadata standards that would allow better cross-searching
of the 3 million articles they hold.

Minimal options
As metadata are expensive to create — it is estimated that tagging papers
with even minimal metadata can add as much as 40% to costs — OAI is
developing its core metadata as a lowest common denominator to avoid putting
an excessive burden on those who wish to take part. But even these skimpy
metadata already allow one to improve retrieval. This strategy is sensible
as it acknowledges the fact that the value and nature of scientific
information are heterogeneous.

Minimal metadata will suffice for much of the literature. But there will
increasingly be sophisticated and novel forms of publications built around
highly organized communities working off large, shared data sets. These hubs
will stand out by their large investment in rich metadata and sophisticated
databases. The future electronic landscape should see such high added-value
hubs evolving as overlays to vast but largely automated literature archives
and databases.

In such an early stage of development, it is essential to avoid dogmatic
solutions. Not all papers will warrant the costs of marking up with
metadata, nor will much of the grey literature, such as conference
proceedings or the large internal documentation of government agencies. Many
high-cost, low-circulation print journals could be replaced by digital
libraries. Overheads would be kept low, and the economics argues that the
cheapest means of handling the bulk of the literature may be automated
digital libraries. Tags automatically generated from machine analysis of the
text, for example, might minimize the quantity of manual metadata needed.

Or take ResearchIndex, software produced by the computer company NEC, which
builds digital libraries with little human intervention. It gathers
scientific papers from around the web and, using simple rules based on
document formatting, can extract the title, abstract, author and references.
It interprets the latter, and can conduct automatic citation analyses for
all the papers indexed. Such digital libraries will also provide new tools,
for example to generate new metrics based on user behaviour, which will
complement and even surpass citation rankings and impact factors.

At the other end of the spectrum, specialized communities organized around
shared data sets will produce highly sophisticated electronic
'publications', making it much more arduous for authors to submit
information because of the amount and detail they will be required to enter
in machine-readable form. Take the Alliance for Cellular Signaling (AfCS), a
10-year, multimillion-dollar, multidisciplinary project run by a consortium
of 20 US institutions. It is taking a systems view of proteins involved in
signalling, and integrating large amounts of data into models that will
piece together how cellular signalling functions as a whole in the cell.
Here, authors would be required to input information, for example, on the
protocols, tissues, cell types, specific concentration factors used and the
experimental outcomes. Inputs would be chosen from menus of strictly defined
terms and ranges, corresponding to predefined knowledge representations and
vocabularies for cell signalling.

The idea is that, rather than simply producing their own data, communities
instead create a vast, shared pool of well-structured information, and
benefit by being able to make much more powerful queries, simulations and
data mining. A series of 'molecule pages' would also pull together virtually
all published data and literature about individual molecules in relation to
signalling.

Indeed, the high-throughput nature of much of modern research means that,
increasingly, important results can be fully expressed only in electronic
rather than print format. Systems biology in particular is driving research
that seeks to describe the function of whole pathways and networks of genes
and proteins, and to cover scales ranging from atoms and molecules to
organisms. Increasingly, the literature and biological databases will
converge to create new forms of publications. Other disciplines stand to
benefit, too.

Helping machines make sense of science on the web
Many communities, including the AfCS, are building ontologies to underpin
such schemes. Ontologies mean different things to different people, but they
are in effect representations that attempt to hard-code human knowledge
about a topic and the intrinsic relationships in ways that computers can
use. The microarray community has been very active in this area. The
Microarray Gene Expression Database group has coordinated global standards;
as a result, users will be able to query vast shared data sets to find all
experiments that use a specified type of biological material, test the
effects of a specified treatment or measure the expression of a specified
gene, and much more.

One major problem is that genes and proteins often have different names in
different organisms, and these often say little about what they do. To get
round this problem, the Gene Ontology (GO) Consortium is creating tree-like
ontologies of the 'molecular function', 'biological process' and 'cellular
component' of gene products. All genes involved in 'DNA repair', for
example, would be mapped to the corresponding GO term, irrespective of their
name or source organism. A microarray gene-expression analysis that
previously yielded only names of expressed genes would in addition carry
mapped GO terms that might reveal, say, that half the genes are involved in
'protein folding'. GO terms can also help to federate disparate databases.

Ontologies can also be used to tag literature automatically, and will be
particularly useful for grey literature and archival material for which
manual tagging was not justified. Papers tagged automatically with concepts
can be matched, grouped into topic maps and mined. By breaking down
terminological barriers between disciplines, this should also enhance
interdisciplinary understanding and even serendipity. Nature is actively
investigating such possibilities.

The GO ontologies are still very incomplete, however, and the internal
relationships need to be enriched. Moreover, caution is required against
prematurely pigeon-holing gene functions, given the uncertainty of most
annotations. Ontologies are also the focus of intensive research in
computing science, and biology is not yet up to speed on this. Efforts such
as GO and the Bio-Ontologies Consortium deserve support. Indeed, given the
shortcomings of existing ontologies and controlled vocabularies, there may
be a case for creating a more organized international effort to ensure
economy of effort, interoperability and sharing of expertise.

The advent of structured papers that are increasingly held in literature
databases blurs further the distinction between the scientific paper and
entries in biological databases. Already, entries in the biological
databases are often hyperlinked to relevant articles in the literature and
vice versa, and CrossRef is developing standards for such linking. As text
becomes more structured, it will be possible to increase the sophistication
of both linking, data manipulation and retrieval.

Biological databases and journals have evolved relatively independently of
one another. Database annotations lack the prestige of published papers;
indeed, their value is largely ignored by citation metrics, and their upkeep
is often regarded as a thankless task. Database curation has consequently
lacked the quality control typical of good journals. The convergence between
databases and the literature means that database annotators and curators
will increasingly perform the functions of journal editors and reviewers,
while publishers will develop sophisticated database platforms and tools.

New ways in
Database- and metadata-driven systems will drive interfaces to publications
from simple keyword search models to ones that reflect the structure of
biological information. Visualization tools of chromosomal location,
biochemical pathways and structural interactions may become the obvious
portals to the wider literature, given that there are far fewer protein
structures or gene sequences than there are articles about them. As Mark
Gerstein, a bioinformaticist at Yale University, points out: "One might 'fly
through' a large three-dimensional molecular structure, such as the
ribosome, where various surface patches would be linked to publications
describing associated chemical binding studies."

Future electronic literature will therefore be much more heterogeneous than
the current journal system, and dogmatic solutions should therefore be
resisted. It is significant and sensible that both CrossRef and OAI have
made key strategic choices favouring openness and adaptability. They seek to
federate distributed actors rather than to create centralized structures.
They also make their work independent of the type of content, which makes it
flexible enough to incorporate and link seamlessly not just papers but news,
books and other media.

Crucially, both OAI and CrossRef have also decided to build systems
independent of the economic mechanisms surrounding that content. Many
publishers, in particular some learned societies, may be willing to make
their content free, perhaps after a certain delay. Others are exploring
business models where authors or sponsors pay, which would allow free access
to articles on publication. The open technological frameworks also mean that
particular communities, such as scientists with specific metadata needs for
their discipline, are free to build in more complex data structures; the
higher overheads incurred may require charging for added-value services.

Neutrality
The OAI and CrossRef strategies therefore differ fundamentally from more
centralized systems proposed by PubMed Central (PMC), operated by the US
National Library of Medicine, and E-Biosci, being developed by the European
Molecular Biology Organization.

But PMC and E-Biosci highlight the urgent need to index the full text of
papers and their metadata and not just abstracts, as is the practice of
PubMed and other aggregators. Services that require publishers to deposit
full text only for indexing and improving search are useful.

Unfortunately, PMC, unlike E-Biosci, confounds this primarily technological
issue with an economic one, by requiring that all text be made available
free after, at most, one year. It is regrettable that PMC has not in the
first instance sought full-text indexing itself as a goal, as this in itself
would be an immediate boon to researchers. It would also probably have been
more successful in attracting publishers.

The reality is that all of those involved in scientific publishing are in a
period of intense experimentation, the outcome of which is difficult to
predict. Getting there will require novel forms of collaboration between
publishers, databases, digital libraries and other stakeholders. It would be
unwise to put all of one's eggs in the basket of any one economic or
technological 'solution'. Diversity is the best bet.

This Opinion article has been inspired by many of the contributions to
Nature's web forum on "Future e-access to the primary literature". The
current table of contents of the forum can be found at the following
address: http://www.nature.com/nature/debates/e-access/




----------------------------------------------------------------------------
----
Nature © Macmillan Publishers Ltd 2001 Registered No. 785998 England.

------=_NextPart_000_0070_01C1365B.CB266720
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 5.50.4522.1800" name=3DGENERATOR></HEAD>
<BODY>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D510303620-05092001>Dear=20
all</SPAN></FONT></DIV>
<DIV><SPAN class=3D510303620-05092001><FONT face=3DArial size=3D2>I =
thought this=20
piece, which will be published in tomorrow's edition of Nature may be of =

interest to you. Forgive me if it skimps on some of the more detailed =
technical=20
issues of OAI, but I hope it gets across the main points on some of =
these issues=20
to a wider audience. I'd be interested to hear any of your =
comments/criticisms.=20
</FONT>
<DIV><FONT face=3DArial size=3D2>Also available on free-access =
on&nbsp;</FONT><A=20
target=3D_new =
href=3D"http://www.nature.com/nature/debates/e-access/"><FONT=20
face=3DArial =
size=3D2>http://www.nature.com/nature/debates/e-access/</FONT></A><FONT=20
face=3DArial><FONT size=3D2>&nbsp;<SPAN=20
class=3D510303620-05092001></SPAN></FONT><BR><FONT size=3D2>Declan=20
Butler</FONT></FONT></DIV>
<DIV><SPAN class=3D510303620-05092001></SPAN><FONT face=3DArial =
size=3D2>E<SPAN=20
class=3D510303620-05092001>uropean correspondent</SPAN></FONT></DIV>
<DIV><SPAN class=3D510303620-05092001></SPAN><SPAN=20
class=3D510303620-05092001></SPAN><FONT face=3DArial><FONT =
size=3D2>N<SPAN=20
class=3D510303620-05092001>ature</SPAN><BR><A=20
href=3D"mailto:d.butler@nature-france.com">d.butler@nature-france.com</A>=
<SPAN=20
class=3D510303620-05092001> </SPAN></FONT></FONT></DIV></SPAN></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D510303620-05092001>
<TABLE cellSpacing=3D0 cellPadding=3D0 width=3D"100%" border=3D0>
  <TBODY>
  <TR>
    <TD align=3Dright><IMG height=3D12 alt=3Dnature=20
      src=3D"http://www.nature.com/nature/images/s_nature.gif" =
width=3D60 border=3D0=20
      valign=3D"bottom"> <FONT face=3D"helvetica, arial, sans serif" =
size=3D2>06=20
      September 2001</FONT> </TD></TR>
  <TR>
    <TD><IMG height=3D28 alt=3DOpinion=20
      src=3D"http://www.nature.com/nature/images/hdr_opinion.gif" =
width=3D312=20
      border=3D0></TD></TR>
  <TR>
    <TD><IMG height=3D5=20
      src=3D"http://www.nature.com/nature/images/titlebar_red.gif" =
width=3D430=20
      border=3D0></TD></TR>
  <TR>
    <TD><FONT face=3D"times, times new roman, serif" =
size=3D1><I>Nature</I>=20
      <B>413</B>, 1 - 3 (2001) =A9 Macmillan Publishers Ltd.</FONT> =
</TD></TR>
  <TR>
    <TD>
      <P><IMG height=3D12 =
src=3D"http://www.nature.com/nature/images/spacer.gif"=20
      width=3D20 < p> </P></TD></TR></TBODY></TABLE><!-- header ends =
here --><!-- text starts here --><FONT=20
face=3D"times, times new roman, serif" size=3D5><B>The future of the =
electronic=20
scientific literature</B></FONT>=20
<P></P><BR>
<P><FONT face=3D"times, times new roman, serif" size=3D3><B>The =
Internet's=20
transformation of scientific communication has only begun, but already =
much of=20
its promise is within reach. The vision below may change in its detail, =
but=20
experimentation and lack of dogmatism are undoubtedly the way=20
forward.</B></FONT>=20
<P><FONT face=3D"times, times new roman, serif" size=3D3>
<P>"The Internet is easier to invent than to predict" is a maxim that =
time has=20
proven to be a truism. Much the same might be said of scientific =
publishing on=20
the Internet, the history of which is littered with failed predictions.=20
Technological advance itself will, of course, bring dramatic changes =
&#8212; and it is=20
a safe bet that bright software minds will punctually overturn any =
vision. But=20
it is becoming clear that developing common standards will be critical =
in=20
determining both the speed and extent of progress towards a scientific =
web.</P>
<P>'Standards' for managing electronic content are hardly a riveting =
topic for=20
researchers. But they are key to a host of issues that affect =
scientists, such=20
as searching, data mining, functionality and the creation of stable, =
long-term=20
archives of research results. Moreover, just as the Internet and web owe =
their=20
success to agreed network protocols on which others were able to build, =
common=20
standards in science will provide a foundation for a diversity of =
publishing=20
models and experiments and be a better alternative to =
'one-size-fits-all'=20
solutions.</P>
<P>This explains why the Open Archives Initiative (OAI), one of many=20
alternatives now being offered to scientists to disseminate their work, =
has now=20
broadened its focus from e-prints to promoting common web standards for =
digital=20
content.</P>
<P>The reason is that some of the most promising emerging technologies =
will only=20
realize their full promise if they are adopted in a consensual fashion =
by entire=20
communities. At the level of the online scientific 'paper', one major =
change,=20
for example, is a shift in format to make papers more computer-readable. =

Searches will become much more powerful; tables and figures will cease =
to be=20
flat, lifeless objects, and instead will be able to be queried and =
manipulated=20
by users, using suites of online visualization and data-analysis =
tools.</P>
<P>This is being made possible by <A target=3D_new=20
href=3D"http://www.w3.org/XML/">Extensible Mark-up Language</A> (XML), =
which=20
allows a document to be tagged with machine-readable 'metadata', in =
effect=20
converting it into a sort of mini-database. Most web pages today are =
coded in <A=20
target=3D_new href=3D"http://www.w3.org/MarkUp/">HTML</A>. But this =
contains=20
information only about a page's appearance. Whereas HTML specifies title =
and=20
author information, for example as simple headings, such as: =
<BR>&lt;H1&gt; The=20
future of the electronic scientific literature &lt;/H1&gt; =
<BR>&lt;H3&gt;by John=20
Smith&lt;/H3&gt; <BR>XML specifies these in a way that computers can =
understand:=20
<BR>&lt;articletitle&gt; The future of the electronic scientific =
literature=20
&lt;/articletitle&gt; =
&lt;author&gt;&lt;firstname&gt;John&lt;/firstname&gt;=20
&lt;lastname&gt; Smith&lt;/lastname&gt;.</P>
<P>The possibilities for tagging are endless. But a major need now is =
for=20
stakeholders to agree on common metadata standards for the basic =
structure of=20
scientific papers. This would allow more specific queries to be made =
across=20
large swathes of the literature. Indeed, what is above all hampering the =

usefulness of today's online journals, e-print archives and scientific =
digital=20
libraries is the lack of means to federate these resources through =
unified=20
interfaces.</P>
<P>The OAI has agreed metadata standards to facilitate improved =
searching across=20
participating archives, which can therefore be queried by users as if =
they were=20
one seamless site. The OAI is attractive compared with centralized =
archives in=20
that it allows any group to create an archive while, by agreeing common=20
standards, they become part of a greater whole. The idea is catching on: =
it is=20
supported by the <A target=3D_new=20
href=3D"http://www.diglib.org/dlfhomepage.htm">Digital Library =
Federation</A>=20
(DLF), a consortium of US libraries and agencies, including the <A =
target=3D_new=20
href=3D"http://www.oclc.org/home/">Online Computer Library Center</A>. =
<A=20
target=3D_new href=3D"http://www.crossref.org/">CrossRef</A>, a =
collaboration of <A=20
target=3D_new href=3D"http://www.crossref.org/members.htm">78 learned =
society and=20
commercial publishers</A>, in which <I>Nature</I>'s publishers are =
taking a=20
leading role, is also actively developing common metadata standards that =
would=20
allow better cross-searching of the 3 million articles they hold.</P>
<P><B>Minimal options</B><BR>As metadata are expensive to create &#8212; =
it is=20
estimated that tagging papers with even minimal metadata can add as much =
as 40%=20
to costs &#8212; OAI is developing its core metadata as a lowest common =
denominator to=20
avoid putting an excessive burden on those who wish to take part. But =
even these=20
skimpy metadata already allow one to improve retrieval. This strategy is =

sensible as it acknowledges the fact that the value and nature of =
scientific=20
information are heterogeneous.</P>
<P>Minimal metadata will suffice for much of the literature. But there =
will=20
increasingly be sophisticated and novel forms of publications built =
around=20
highly organized communities working off large, shared data sets. These =
hubs=20
will stand out by their large investment in rich metadata and =
sophisticated=20
databases. The future electronic landscape should see such high =
added-value hubs=20
evolving as overlays to vast but largely automated literature archives =
and=20
databases.</P>
<P>In such an early stage of development, it is essential to avoid =
dogmatic=20
solutions. Not all papers will warrant the costs of marking up with =
metadata,=20
nor will much of the grey literature, such as conference proceedings or =
the=20
large internal documentation of government agencies. Many high-cost,=20
low-circulation print journals could be replaced by digital libraries. =
Overheads=20
would be kept low, and the economics argues that the cheapest means of =
handling=20
the bulk of the literature may be automated digital libraries. Tags=20
automatically generated from machine analysis of the text, for example, =
might=20
minimize the quantity of manual metadata needed.</P>
<P>Or take <A target=3D_new=20
href=3D"http://citeseer.nj.nec.com/cs">ResearchIndex</A>, software =
produced by the=20
computer company <A target=3D_new =
href=3D"http://www.neci.nec.com/">NEC</A>, which=20
builds digital libraries with little human intervention. It gathers =
scientific=20
papers from around the web and, using simple rules based on document =
formatting,=20
can extract the title, abstract, author and references. It interprets =
the=20
latter, and can conduct automatic citation analyses for all the papers =
indexed.=20
Such digital libraries will also provide new tools, for example to =
generate new=20
metrics based on user behaviour, which will complement and even surpass =
citation=20
rankings and impact factors.</P>
<P>At the other end of the spectrum, specialized communities organized =
around=20
shared data sets will produce highly sophisticated electronic =
'publications',=20
making it much more arduous for authors to submit information because of =
the=20
amount and detail they will be required to enter in machine-readable =
form. Take=20
the <A target=3D_new href=3D"http://cellularsignaling.org/">Alliance for =
Cellular=20
Signaling</A> (AfCS), a 10-year, multimillion-dollar, multidisciplinary =
project=20
run by a consortium of 20 US institutions. It is taking a systems view =
of=20
proteins involved in signalling, and integrating large amounts of data =
into=20
models that will piece together how cellular signalling functions as a =
whole in=20
the cell. Here, authors would be required to input information, for =
example, on=20
the protocols, tissues, cell types, specific concentration factors used =
and the=20
experimental outcomes. Inputs would be chosen from menus of strictly =
defined=20
terms and ranges, corresponding to predefined knowledge representations =
and=20
vocabularies for cell signalling.</P>
<P>The idea is that, rather than simply producing their own data, =
communities=20
instead create a vast, shared pool of well-structured information, and =
benefit=20
by being able to make much more powerful queries, simulations and data =
mining. A=20
series of '<A target=3D_new=20
href=3D"http://www.cellularsignaling.org/mini_molecule_pages/">molecule =
pages</A>'=20
would also pull together virtually all published data and literature =
about=20
individual molecules in relation to signalling.</P>
<P>Indeed, the high-throughput nature of much of modern research means =
that,=20
increasingly, important results can be fully expressed only in =
electronic rather=20
than print format. Systems biology in particular is driving research =
that seeks=20
to describe the function of whole pathways and networks of genes and =
proteins,=20
and to cover scales ranging from atoms and molecules to organisms. =
Increasingly,=20
the literature and biological databases will converge to create new =
forms of=20
publications. Other disciplines stand to benefit, too.</P>
<P><B>Helping machines make sense of science on the web</B><BR>Many =
communities,=20
including the AfCS, are building ontologies to underpin such schemes. =
Ontologies=20
mean different things to different people, but they are in effect=20
representations that attempt to hard-code human knowledge about a topic =
and the=20
intrinsic relationships in ways that computers can use. The microarray =
community=20
has been very active in this area. The <A target=3D_new=20
href=3D"http://www.mged.org/">Microarray Gene Expression Database =
group</A> has=20
coordinated global standards; as a result, users will be able to query =
vast=20
shared data sets to find all experiments that use a specified type of =
biological=20
material, test the effects of a specified treatment or measure the =
expression of=20
a specified gene, and much more.</P>
<P>One major problem is that genes and proteins often have different =
names in=20
different organisms, and these often say little about what they do. To =
get round=20
this problem, the <A target=3D_new =
href=3D"http://www.geneontology.org/">Gene=20
Ontology (GO) Consortium</A> is creating <A target=3D_new=20
href=3D"http://www.informatics.jax.org/go/">tree-like ontologies</A> of =
the=20
'molecular function', 'biological process' and 'cellular component' of =
gene=20
products. All genes involved in 'DNA repair', for example, would be =
mapped to=20
the corresponding GO term, irrespective of their name or source =
organism. A=20
microarray gene-expression analysis that previously yielded only names =
of=20
expressed genes would in addition carry mapped GO terms that might =
reveal, say,=20
that half the genes are involved in 'protein folding'. GO terms can also =
help to=20
federate disparate databases.</P>
<P>Ontologies can also be used to tag literature automatically, and will =
be=20
particularly useful for grey literature and archival material for which =
manual=20
tagging was not justified. Papers tagged automatically with concepts can =
be=20
matched, grouped into topic maps and mined. By breaking down =
terminological=20
barriers between disciplines, this should also enhance interdisciplinary =

understanding and even serendipity. <I>Nature</I> is actively =
investigating such=20
possibilities.</P>
<P>The GO ontologies are still very incomplete, however, and the =
internal=20
relationships need to be enriched. Moreover, caution is required against =

prematurely pigeon-holing gene functions, given the uncertainty of most=20
annotations. Ontologies are also the focus of intensive research in =
computing=20
science, and biology is not yet up to speed on this. Efforts such as GO =
and the=20
<A target=3D_new=20
href=3D"http://smi-web.stanford.edu/projects/bio-ontology/">Bio-Ontologie=
s=20
Consortium</A> deserve support. Indeed, given the shortcomings of =
existing=20
ontologies and controlled vocabularies, there may be a case for creating =
a more=20
organized international effort to ensure economy of effort, =
interoperability and=20
sharing of expertise.</P>
<P>The advent of structured papers that are increasingly held in =
literature=20
databases blurs further the distinction between the scientific paper and =
entries=20
in biological databases. Already, entries in the biological databases =
are often=20
hyperlinked to relevant articles in the literature and vice versa, and =
CrossRef=20
is developing standards for such linking. As text becomes more =
structured, it=20
will be possible to increase the sophistication of both linking, data=20
manipulation and retrieval.</P>
<P>Biological databases and journals have evolved relatively =
independently of=20
one another. Database annotations lack the prestige of published papers; =
indeed,=20
their value is largely ignored by citation metrics, and their upkeep is =
often=20
regarded as a thankless task. Database curation has consequently lacked =
the=20
quality control typical of good journals. The convergence between =
databases and=20
the literature means that database annotators and curators will =
increasingly=20
perform the functions of journal editors and reviewers, while publishers =
will=20
develop sophisticated database platforms and tools.</P>
<P><B>New ways in</B><BR>Database- and metadata-driven systems will =
drive=20
interfaces to publications from simple keyword search models to ones =
that=20
reflect the structure of biological information. Visualization tools of=20
chromosomal location, biochemical pathways and structural interactions =
may=20
become the obvious portals to the wider literature, given that there are =
far=20
fewer protein structures or gene sequences than there are articles about =
them.=20
As <A target=3D_new href=3D"http://bioinfo.mbb.yale.edu/">Mark =
Gerstein</A>, a=20
bioinformaticist at Yale University, points out: "One might 'fly =
through' a=20
large three-dimensional molecular structure, such as the <A =
target=3D_new=20
href=3D"http://smi-web.stanford.edu/projects/helix/riboweb.html">ribosome=
</A>,=20
where various surface patches would be linked to publications describing =

associated chemical binding studies."</P>
<P>Future electronic literature will therefore be much more =
heterogeneous than=20
the current journal system, and dogmatic solutions should therefore be =
resisted.=20
It is significant and sensible that both CrossRef and OAI have made key=20
strategic choices favouring openness and adaptability. They seek to =
federate=20
distributed actors rather than to create centralized structures. They =
also make=20
their work independent of the type of content, which makes it flexible =
enough to=20
incorporate and link seamlessly not just papers but news, books and =
other=20
media.</P>
<P>Crucially, both OAI and CrossRef have also decided to build systems=20
independent of the economic mechanisms surrounding that content. Many=20
publishers, in particular some learned societies, may be willing to make =
their=20
content free, perhaps after a certain delay. Others are exploring =
business=20
models where authors or sponsors pay, which would allow free access to =
articles=20
on publication. The open technological frameworks also mean that =
particular=20
communities, such as scientists with specific metadata needs for their=20
discipline, are free to build in more complex data structures; the =
higher=20
overheads incurred may require charging for added-value services.</P>
<P><B>Neutrality</B><BR>The OAI and CrossRef strategies therefore differ =

fundamentally from more centralized systems proposed by <A target=3D_new =

href=3D"http://www.pubmedcentral.nih.gov/">PubMed Central</A> (PMC), =
operated by=20
the <A target=3D_new href=3D"http://www.nlm.nih.gov/">US National =
Library of=20
Medicine</A>, and <A target=3D_new=20
href=3D"http://www.embo.org/E_Pub_pages.html">E-Biosci</A>, being =
developed by the=20
<A target=3D_new href=3D"http://www.embo.org/index.html">European =
Molecular Biology=20
Organization</A>.</P>
<P>But PMC and E-Biosci highlight the urgent need to index the full text =
of=20
papers and their metadata and not just abstracts, as is the practice of =
<A=20
target=3D_new=20
href=3D"http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=3DPubMed">PubMed=
</A> and=20
other aggregators. Services that require publishers to deposit full text =
only=20
for indexing and improving search are useful.</P>
<P>Unfortunately, PMC, unlike E-Biosci, confounds this primarily =
technological=20
issue with an economic one, by requiring that all text be made available =
free=20
after, at most, one year. It is regrettable that PMC has not in the =
first=20
instance sought full-text indexing itself as a goal, as this in itself =
would be=20
an immediate boon to researchers. It would also probably have been more=20
successful in attracting publishers.</P>
<P>The reality is that all of those involved in scientific publishing =
are in a=20
period of intense experimentation, the outcome of which is difficult to =
predict.=20
Getting there will require novel forms of collaboration between =
publishers,=20
databases, digital libraries and other stakeholders. It would be unwise =
to put=20
all of one's eggs in the basket of any one economic or technological =
'solution'.=20
Diversity is the best bet.</P>
<P><I>This Opinion article has been inspired by many of the =
contributions to=20
</I>Nature<I>'s web forum on "Future e-access to the primary =
literature". The=20
current table of contents of the forum can be found at the following=20
address:</I> <A target=3D_new=20
href=3D"http://www.nature.com/nature/debates/e-access/">http://www.nature=
.com/nature/debates/e-access/</A></P></FONT>
<P><FONT face=3D"helvetica, arial, sans serif" size=3D1></FONT></P><!-- =
text ends here --><! -- local navigation starts here -- ><! -- local =
navigation ends here -- ><!-- trailer --><BR>
<HR align=3Dleft width=3D480>
<A href=3D"http://www.nature.com/UNKNOWN/"><IMG height=3D23=20
alt=3D"Macmillan Magazines" hspace=3D10=20
src=3D"http://www.nature.com/nature/images/macmillanlogo.gif" width=3D25 =

align=3Dcenter vspace=3D8 border=3D0></A><FONT face=3D"helvetica, arial, =
sans serif"=20
size=3D2>Nature =A9 Macmillan Publishers Ltd 2001 Registered No. 785998=20
England.</FONT> </SPAN></FONT></DIV></BODY></HTML>

------=_NextPart_000_0070_01C1365B.CB266720--