[OAI-implementers] Automatically gathering the full-text of eprints
herbert van de sompel
Wed, 17 Mar 2004 12:19:39 -0700
The problem of service providers needing access to content in addition to
metadata has come up in many discussions, lately, including in the realm of the
DARE, DINI, JISC, DSpace, Fedora, etc work. It so happens that my team in Los
Alamos has recently done quite some work in this realm, as is illustrated by the
most recent papers listed on my personal web site.
Here is some initial feedback to the proposal. The proposal relies on:
a. The assumption that a harvester knows that something that is in the
dc.identifier element of oai_dc points to a - compliant - jump-off page. There
are two problems with this assumption:
- lots of things can be in the dc.identifier element both resolvable and
- lots of things at the end of the thing identified by the content of
dc.identifier (if resolvable) will not be compliant jump-off pages
This means harvesters never really know when they are facing the scenario that
you target, and hence will do a lot of meaningless dererferencing and parsing.
One could think of addressing this to some extent by a special-purpose
Descriptor in the Identify response to indicate that a repository actually is
'compliant' but that would still leave the harvester guessing about which of the
dc.identifiers (if there are multiple) is the magic one.
b. The actual existence of a 'jump-off' page. This is something that - in the
context of the OAI-PMH (with its disconnection of DP and SP) we can not just
take for granted or assume.
There are other problems related to obtaining content which are not covered by
* How does a harvester know when to go after an update to content? The OAI-PMH
indicates that the datestamp of a record only changes when the metadata has
changed, it doesn't say anything about the content. I suggest it should stay
that way. So, in the proposed solution, content in a repo can change without
the harvester ever knowing about it.
* The scenario as described in the propsoal, in which a single metadata record
corresponds with a single "preprint" is only a special case of - future -
reality. Increasingly, objects held in and described by repositories will be
"compound" or "complex", i.e. consisting of multiple datastreams, not just a
single "preprint". I find that it would be desirable that a solution to get to
the content would be able to handle such situations. The proposed solution
could actually accomodate such 'compound' objects, because the mutliple
datastreams are linked off the jump-off page. There is, however, a problem.
Let's presume we have a situation in which an object is deposited in an
institutional repository that has 2 datastreams, each of which actually has a
unique identifier, say a doi or something. Thinking of a - future -
self-archiving scenario and the trend to accord identifiers at finer levels of
granularity, this is not unlikely at all. Now we get 3 things in dc.identifier
(2 doi's and a link to a jump-off page), and 2 things in the jump-off page
(links to the 2 datastreams). How do I know which doi goes with which
datastream? Information that - I hope we will all agree - is rather significant.
OK. The point I am trying to make is that the described scenario and its more
general problem domain (beyond eprints, and into the realm of objects with
multiple datastreams) may call for another approach. Our research has shown
that such an approach can remain 100% OAI-PMH-based if a complex object format
such as METS, MPEG-21 DIDL or SCORM is used. These formats can be "parallel"
OAI-PMH "metadata formats" through which harvesters can get to the content
without running into issues such as the ones mentioned above. Content can be
embedded in the XML wrappers or pointed at by them. Identifiers can be
unambiguously connected to content. If content changes, the datstamp of the
"conplex" record changes.
I anticipate concerns re the overhead of introducing a solution based on a
complex object format. At this point, I would like to say 2 things with this
* It took 2 people on my team about 2 days to create a prototype plug-in that
enables OAI-PMH harvesting of content from DSpace repositories. Our plug-in
rendered content using the MPEG-21 DIDL XML wrapper format. Most of the time
invested in this plug-in was spent figuring out the DSpace API and a sensible
way to map the DSpace data model to the DIDL data model. The prototype was
demonstrated at the DSpace federation meeting, last week. Although
questions/issues did arise in the course of our work, non seemed unsolvable.
But it is my impression that the very fast delivery of a prototype indicates the
feasibility of the complex format approach.
* I would personally be very willing to spend time with the apporpiate
representatives of the community - including yourself - to work towards a
solution that is future-proof and provides adequate guarantees regarding
perceived requirements of a content-harvesting solution. I would actually
prefer that over going for a solution which is attractive at first glance
because of its obvious simplicity, but which seems to raise some relevant
questions upon closer inspection.
To end, I would like to thank you for bringing this topic to the list. I have
had many private email exchanges over the last few months especially with
representatives from DARE and DINI about this and related problem domains. I
hope that your mail can be another impulse towards a joint action in this realm.
The problem is very real, and I would love our community to jointly create a
really good solution to it.
Andy Powell wrote:
> The JISC-funded ePrints UK project has a requirement to automatically
> harvest both metadata and full-text from the eprint archives within UK
> academia (and potentially elsewhere). This is so that we can pass both
> metadata and full-text to the various 'enhancement' Web services offered
> by our partners.
> In order for our harvesting robot to be able to do this, it must be able
> to reliably (and automatically) determine the correct URL(s) for the
> various full-text manifestation(s) (HTML, PDF, RTF, etc.) of each eprint.
> Our "Using simple Dublin Core to describe eprints" guidelines are intended
> to encourage greater consistency in the metadata that is exposed by eprint
> archives using the 'oai_dc' format within the OAI Protocol for Metadata
> Harvesting (OAI-PMH). Somewhat perversely, because we stick rigidly to
> the semantics of the DC element set, our guidelines make determining the
> URL of each manifestation that is available quite difficult. (This is
> largely a consequence of the 'simple' nature of 'simple DC'!). In
> general, the URL in the <dc:identifier> element of the oai_dc record is
> the URL of a jump-off page, rather than a direct link to the full-text.
> We would like to suggest a new proposal for unambiguously embedding the
> URL for each manifestation of an eprint into the (X)HTML jump-off page for
> that eprint. Since the jump-off page is generated automatically by the
> eprint archive software, doing this shouldn't be too difficult (in fact,
> we would hope that archive software, such as eprints.org, will be
> configured to do this out of the box).
> If this proposal is adopted, it will make it much easier to write OAI
> service provider software that can reliably gather the full-text of an
> eprint, given only the oai_dc record for that eprint.
> The proposal is at
> Comments are welcome,
> Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
> http://www.ukoln.ac.uk/ukoln/staff/a.powell/ +44 1225 383933
> Resource Discovery Network http://www.rdn.ac.uk/
> ECDL 2004, Bath, UK - 12-17 Sept 2004 - http://www.ecdl2004.org/
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
Herbert Van de Sompel
digital library research & prototyping
Los Alamos National Laboratory - Research Library
+ 1 (505) 667 1267 / http://lib-www.lanl.gov/~herbertv/
"met gestreken jeans de dansvloer penetreren"