[OAI-implementers] Automatically gathering the full-text of eprints

herbert van de sompel herbertv@lanl.gov
Wed, 17 Mar 2004 12:19:39 -0700

Dear Andy,

The problem of service providers needing access to content in addition to 
metadata has come up in many discussions, lately, including in the realm of the 
DARE, DINI, JISC, DSpace, Fedora, etc work.  It so happens that my team in Los 
Alamos has recently done quite some work in this realm, as is illustrated by the 
most recent papers listed on my personal web site.

Here is some initial feedback to the proposal.   The proposal relies on:

a. The assumption that a harvester knows that something that is in the 
dc.identifier element of oai_dc points to a - compliant - jump-off page.  There 
are two problems with this assumption:
- lots of things can be in the dc.identifier element both resolvable and 
- lots of things at the end of the thing identified by the content of 
dc.identifier (if resolvable) will not be compliant jump-off pages

This means harvesters never really know when they are facing the scenario that 
you target, and hence will do a lot of meaningless dererferencing and parsing. 
One could think of addressing this to some extent by a special-purpose 
Descriptor in the Identify response to indicate that a repository actually is 
'compliant' but that would still leave the harvester guessing about which of the 
dc.identifiers (if there are multiple) is the magic one.

b. The actual existence of a 'jump-off' page.  This is something that - in the 
context of the OAI-PMH (with its disconnection of DP and SP) we can not just 
take for granted or assume.

There are other problems related to obtaining content which are not covered by 
the solution:

* How does a harvester know when to go after an update to content?  The OAI-PMH
indicates that the datestamp of a record only changes when the metadata has 
changed, it doesn't say anything about the content.  I suggest it should stay 
that way.  So, in the proposed solution, content in a repo can change without 
the harvester ever knowing about it.

* The scenario as described in the propsoal, in which a single metadata record 
corresponds with a single "preprint" is only a special case of - future - 
reality.  Increasingly, objects held in and described by repositories will be 
"compound" or "complex", i.e. consisting of multiple datastreams, not just a 
single "preprint".  I find that it would be desirable that a solution to get to 
the content would be able to handle such situations.  The proposed solution 
could actually accomodate such 'compound' objects, because the mutliple 
datastreams are linked off the jump-off page.  There is, however, a problem. 
Let's presume we have a situation in which an object is deposited in an 
institutional repository that has 2 datastreams, each of which actually has a 
unique identifier, say a doi or something.  Thinking of a - future - 
self-archiving scenario and the trend to accord identifiers at finer levels of 
granularity, this is not unlikely at all.  Now we get 3 things in dc.identifier 
(2 doi's and a link to a jump-off page), and 2 things in the jump-off page 
(links to the 2 datastreams).  How do I know which doi goes with which 
datastream?  Information that - I hope we will all agree - is rather significant.

OK.  The point I am trying to make is that the described scenario and its more 
general problem domain (beyond eprints, and into the realm of objects with 
multiple datastreams) may call for another approach.  Our research has shown 
that such an approach can remain 100% OAI-PMH-based if a complex object format 
such as METS, MPEG-21 DIDL or SCORM is used.  These formats can be "parallel" 
OAI-PMH "metadata formats" through which harvesters can get to the content 
without running into issues such as the ones mentioned above.  Content can be 
embedded in the XML wrappers or pointed at by them.  Identifiers can be 
unambiguously connected to content.  If content changes, the datstamp of the 
"conplex" record changes.

I anticipate concerns re the overhead of introducing a solution based on a 
complex object format.  At this point, I would like to say 2 things with this 

* It took 2 people on my team about 2 days to create a prototype plug-in that 
enables OAI-PMH harvesting of content from DSpace repositories.  Our plug-in 
rendered content using the MPEG-21 DIDL XML wrapper format.  Most of the time 
invested in this plug-in was spent figuring out the DSpace API and a sensible 
way to map the DSpace data model to the DIDL data model.  The prototype was 
demonstrated at the DSpace federation meeting, last week.  Although 
questions/issues did arise in the course of our work, non seemed unsolvable. 
But it is my impression that the very fast delivery of a prototype indicates the 
feasibility of the complex format approach.

* I would personally be very willing to spend time with the apporpiate 
representatives of the community - including yourself - to work towards a 
solution that is future-proof and provides adequate guarantees regarding 
perceived requirements of a content-harvesting solution.  I would actually 
prefer that over going for a solution which is attractive at first glance 
because of its obvious simplicity, but which seems to raise some relevant 
questions upon closer inspection.

To end, I would like to thank you for bringing this topic to the list.  I have 
had many private email exchanges over the last few months especially with 
representatives from DARE and DINI about this and related problem domains.  I 
hope that your mail can be another impulse towards a joint action in this realm. 
  The problem is very real, and I would love our community to jointly create a 
really good solution to it.

many greetings


Andy Powell wrote:

> The JISC-funded ePrints UK project has a requirement to automatically
> harvest both metadata and full-text from the eprint archives within UK
> academia (and potentially elsewhere).  This is so that we can pass both
> metadata and full-text to the various 'enhancement' Web services offered
> by our partners.
> http://www.rdn.ac.uk/projects/eprints-uk/
> In order for our harvesting robot to be able to do this, it must be able
> to reliably (and automatically) determine the correct URL(s) for the
> various full-text manifestation(s) (HTML, PDF, RTF, etc.) of each eprint.
> Our "Using simple Dublin Core to describe eprints" guidelines are intended
> to encourage greater consistency in the metadata that is exposed by eprint
> archives using the 'oai_dc' format within the OAI Protocol for Metadata
> Harvesting (OAI-PMH).  Somewhat perversely, because we stick rigidly to
> the semantics of the DC element set, our guidelines make determining the
> URL of each manifestation that is available quite difficult.  (This is
> largely a consequence of the 'simple' nature of 'simple DC'!).  In
> general, the URL in the <dc:identifier> element of the oai_dc record is
> the URL of a jump-off page, rather than a direct link to the full-text.
> We would like to suggest a new proposal for unambiguously embedding the
> URL for each manifestation of an eprint into the (X)HTML jump-off page for
> that eprint.  Since the jump-off page is generated automatically by the
> eprint archive software, doing this shouldn't be too difficult (in fact,
> we would hope that archive software, such as eprints.org, will be
> configured to do this out of the box).
> If this proposal is adopted, it will make it much easier to write OAI
> service provider software that can reliably gather the full-text of an
> eprint, given only the oai_dc record for that eprint.
> The proposal is at
> http://www.rdn.ac.uk/projects/eprints-uk/docs/encoding-fulltext-links/
> Comments are welcome,
> Andy
> --
> Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
> http://www.ukoln.ac.uk/ukoln/staff/a.powell/      +44 1225 383933
> Resource Discovery Network http://www.rdn.ac.uk/
> ECDL 2004, Bath, UK - 12-17 Sept 2004 - http://www.ecdl2004.org/
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers

Herbert Van de Sompel
digital library research & prototyping
Los Alamos National Laboratory - Research Library
+ 1 (505) 667 1267 / http://lib-www.lanl.gov/~herbertv/

"met gestreken jeans de dansvloer penetreren"