[OAI-implementers] Automatically gathering the full-text of eprints

herbert van de sompel herbertv@lanl.gov
Wed, 17 Mar 2004 15:54:55 -0700

Tim Brody wrote:

> We've done a preliminary implementation of this at Southampton for:
> eprints.ecs.soton.ac.uk
> and
> eprints.soton.ac.uk
> It took me about an hour to do, I suspect Chris did it in much less time 
> :-)

I trust that the amount of seconds it takes to implement a solution is not the 
only evaluation criterion.  I very much agree it is an important one, and it is 
one that has always played a significant role in designing the OAI-PMH and 
related specifications.  But it seems to me that there are other criteria such 
as meeting functional requirements that play.  I have, obviously, not seen the 
list of requirements.  I do understand the goals, however.  And, as described in 
my previous mail, I can think of some possible requirements related to those 
goals that may not be met by the proposed solution.

This consideration clearly allows for alternative solutions to the problem than 
the one based on complex objects, which I described in my previous mail.  I 
suggested the complex object path because we have done quite some work in that 
realm, and because that work has urged us to think in a general way about the 
content-harvesting problem.



> All the best,
> Tim.
> Andy Powell wrote:
>> The JISC-funded ePrints UK project has a requirement to automatically
>> harvest both metadata and full-text from the eprint archives within UK
>> academia (and potentially elsewhere).  This is so that we can pass both
>> metadata and full-text to the various 'enhancement' Web services offered
>> by our partners.
>> http://www.rdn.ac.uk/projects/eprints-uk/
>> In order for our harvesting robot to be able to do this, it must be able
>> to reliably (and automatically) determine the correct URL(s) for the
>> various full-text manifestation(s) (HTML, PDF, RTF, etc.) of each eprint.
>> Our "Using simple Dublin Core to describe eprints" guidelines are 
>> intended
>> to encourage greater consistency in the metadata that is exposed by 
>> eprint
>> archives using the 'oai_dc' format within the OAI Protocol for Metadata
>> Harvesting (OAI-PMH).  Somewhat perversely, because we stick rigidly to
>> the semantics of the DC element set, our guidelines make determining the
>> URL of each manifestation that is available quite difficult.  (This is
>> largely a consequence of the 'simple' nature of 'simple DC'!).  In
>> general, the URL in the <dc:identifier> element of the oai_dc record is
>> the URL of a jump-off page, rather than a direct link to the full-text.
>> We would like to suggest a new proposal for unambiguously embedding the
>> URL for each manifestation of an eprint into the (X)HTML jump-off page 
>> for
>> that eprint.  Since the jump-off page is generated automatically by the
>> eprint archive software, doing this shouldn't be too difficult (in fact,
>> we would hope that archive software, such as eprints.org, will be
>> configured to do this out of the box).
>> If this proposal is adopted, it will make it much easier to write OAI
>> service provider software that can reliably gather the full-text of an
>> eprint, given only the oai_dc record for that eprint.
>> The proposal is at
>> http://www.rdn.ac.uk/projects/eprints-uk/docs/encoding-fulltext-links/
>> Comments are welcome,
>> Andy
Herbert Van de Sompel
digital library research & prototyping
Los Alamos National Laboratory - Research Library
+ 1 (505) 667 1267 / http://lib-www.lanl.gov/~herbertv/

"met gestreken jeans de dansvloer penetreren"