[OAI-implementers] Automatically gathering the full-text of eprints

Andy Powell a.powell@ukoln.ac.uk
Wed, 17 Mar 2004 13:28:29 +0000 (GMT Standard Time)

The JISC-funded ePrints UK project has a requirement to automatically
harvest both metadata and full-text from the eprint archives within UK
academia (and potentially elsewhere).  This is so that we can pass both
metadata and full-text to the various 'enhancement' Web services offered
by our partners.


In order for our harvesting robot to be able to do this, it must be able
to reliably (and automatically) determine the correct URL(s) for the
various full-text manifestation(s) (HTML, PDF, RTF, etc.) of each eprint.

Our "Using simple Dublin Core to describe eprints" guidelines are intended
to encourage greater consistency in the metadata that is exposed by eprint
archives using the 'oai_dc' format within the OAI Protocol for Metadata
Harvesting (OAI-PMH).  Somewhat perversely, because we stick rigidly to
the semantics of the DC element set, our guidelines make determining the
URL of each manifestation that is available quite difficult.  (This is
largely a consequence of the 'simple' nature of 'simple DC'!).  In
general, the URL in the <dc:identifier> element of the oai_dc record is
the URL of a jump-off page, rather than a direct link to the full-text.

We would like to suggest a new proposal for unambiguously embedding the
URL for each manifestation of an eprint into the (X)HTML jump-off page for
that eprint.  Since the jump-off page is generated automatically by the
eprint archive software, doing this shouldn't be too difficult (in fact,
we would hope that archive software, such as eprints.org, will be
configured to do this out of the box).

If this proposal is adopted, it will make it much easier to write OAI
service provider software that can reliably gather the full-text of an
eprint, given only the oai_dc record for that eprint.

The proposal is at


Comments are welcome,

