[OAI-implementers] Automatically gathering the full-text of eprints

herbert van de sompel herbertv@lanl.gov
Fri, 19 Mar 2004 12:47:19 -0700

Dear Andy,

Thanks a lot for your thoughtful comments.  I provide some feedback, here, 
hoping that we can find some time to discuss all of this with representatives 
from the community in front of a much needed blackboard (whiteboard?).

First, let me say that this mail isn't at all about trying to prove you wrong. 
  Quite to the contrary. This is about conveying my perception of matters, 
hoping to further our joint insights in this rather complicated domain.

Second, I feel that your FRBR-related comments, while very legitimate, are on 
quite the opposite end of the scale of your pragmatic, useful hack.  I need to 
take some time to try and think about how much or how little a discussion 
related to making content accessible to service providers in the OAI-PMH 
framework should get involved with this.  At this point I am puzzled.  For 
example, I am not sure how much Google cares about the work/manifestation issue. 
  Google wants a FRBR Item.

Third, I must emphasize that I am very pleased to hear that - in principle - you 
find the notion of shipping modelled representations (DIDL, METS, ...) of 
resources through the OAI-PMH acceptable.  Below, I hope to give some more 
indications as to why I feel that is indeed acceptable/appropriate in the 
context of the OAI-PMH.

=> Your W3C TAG resource/representation perspective is very helpful.  Building 
on your insights, and with some stretching, one could distinguish the following 

level 1: W3C.resource ~ FRBR.work ~ OAI-PMH.resource

level 2: W3C.representation ~ FRBR.manifestation ~ OAI-PMH.record

=> Comments I want to make at this point:

* The OAI-PMH doesn't really say that an OAI-PMH.resource sits at level 1, and I 
am not even sure it matters a lot for this discussion because the action we are 
interested in is at level 2 at which the OAI-PMH.resource clearly does not sit.

* Putting OAI-PMH.record at the level of W3C.representation makes a lot of 
sense, especially since v.2.0 of the protocol in which OAI-PMH.records have 
gained autonomy by getting their own datestamp.

* I dare to suggest that OAI-PMH.record ~ OAI-PMH.metadata ~ structured data 
pertaining to an OAI-PMH.resource.  I dare to make this statement because the 
OAI-PMH has this built-in notion that equals metadata to XML.  So I feel it is 
actually constructive for our reasoning to get rid of the term 'metadata' with 
its numerous and in many cases vague/loaded interpretations, and consider an 
OAI-PMH.record to be structured data pertaining to an OAI-PMH.resource.  That is 
a quite unambiguous definition.

* I think OAI-PMH.item doesn't matter in this discussion as it merely is a 
gateway to OAI-PMH.records.  I feel we can loose that term too for our discussion.

=> I really don't think that OAI-PMH.identifier matters in any of this.  While 
the OAI-PMH.identifier is a crucial key for harvesting, it doesn't need to have 
anything to do with any 'real' identifiers, nor with the real world data. Agreed 
that in some implementations - for practical reasons - it does, but there is no 
reason for it to.  So we should not be distracted by it.  If we really need to 
accord meaning to OAI-PMH.identifier, we could consider it to be the identifier 
shared by all W3C.representations of a W3C.resource (~OAI-PMH.resource), as it 
acts as the gateway to all OAI-PMH.records pertaining to a OAI-PMH.resource. 
Even in this interpretation, the OAI-PMH.identifier doesn't come close to 
becoming the identifier of the OAI-PMH.resource, irrespective of what the exact 
nature of the OAI-PMH.record is.
- Cf. the W3C TAG distinction between URI for resource and URI for representation.
- Cf URI for resource == doi / URI for representation is OAI-PMH request using 
unrelated OAI-PMH.identifier

=> So, I think we lost quite some overhead in the above.  We are down to 
OAI-PMH.resource (rather undefined, how nice) and OAI-PMH.record (well defined, 
as being structured data pertaining to OAI-PMH.resource) to play with.  I think 
we can all go along with an interpretation that an oai_dc record is a 
W3C.representation of an OAI-PMH.resource.  I trust we would also agree this is 
the case for a special-purpose QDC record that 'models' the OAI-PMH.resource, 
and in doing so includes some links to datastreams of which that 
OAI-PMH.resource consists.  The step to a complex object solution (METS, DIDL, 
...) is really small from here as those indeed provide such by-reference 
technique to include datastreams, as well as by-value techniques to do so.  In 
addtion, some complex object approaches actually have a data model so that the 
required 'modelling' boils down to mapping a specific world view to the existing 
data model.

=> I very much share your opinion that directly shipping a 
datastream/representation of the resource in an 'unmodelled' manner smells 
really fishy, as it makes us loose the 'structured data pertaining to the 
resource' life buoy.

Jeez, this took me ages to write.  And now it is my turn to be proven wrong ;-)



Andy Powell wrote:

> On Wed, 17 Mar 2004, herbert van de sompel wrote:
>>a. The assumption that a harvester knows that something that is in the
>>dc.identifier element of oai_dc points to a - compliant - jump-off page.  There
>>are two problems with this assumption:
>>- lots of things can be in the dc.identifier element both resolvable and
>>- lots of things at the end of the thing identified by the content of
>>dc.identifier (if resolvable) will not be compliant jump-off pages
> Herbert,
> thanks for the email.  Yes, I completely agree with this analysis.  The
> proposal is a bit of a hack, and we should perhaps have made this clearer
> in the document!
> However, I think it is a useful hack :-).  Particularly so in the context
> of our other recommendations for using simple DC to describe eprints.
> Furthermore, I would see it as good practice to embed XHTML <link>
> elements into jump-off pages anyway - irrespective of whether the
> intention is to ease harvesting by robots or not.  So I certainly don't
> see our proposal as causing any harm.
> The rest of your email raises some quite significant issues - some of
> which I suspect are not very easy to discuss by email.  I don't propose
> giving a detailed response here, but I would like to note a few issues for
> consideration...
> Firstly, your comments about the complexity of the objects being described
> only goes part-way to describing the problem.  The OAI-PMH specification,
> rightly, says very little about the nature of the resources that are
> described by the records exchanged using the protocol.  However,
> particular applications of the protocol do need to be clear about the
> nature of the resources being described.  Furthermore, the complexity of
> the problem is not just about whether the resources being described are
> aggregations of multiple objects.  Part of the complexity arises because
> those those resources/objects fit into a model of the real world that
> spans both 'conceptual' works and specific digital or physical
> 'manifestations' of those conceptual works.
> Does the oai_dc record that I allow you to harvest describe a conceptual
> work (or expression of a work), an article for example, or does it
> describe one of the particular manifestations of that work, the PDF copy
> of the article for example?
> You'll note that I am intentionally using terms from the IFLA FRBR
> (Functional Requirements for Bibliographic Records) model here.
> In our guidelines for using simple DC to describe eprints we made the
> explicit decision to reflect the fact that most implementations of eprint
> archives (that we looked at) appeared to be configured to expose oai_dc
> metadata about the 'work' rather than about the particular manifestations
> of the work (though actually, in many cases (even in our own guidelines
> to a certain extent) there is a certain amount of fuzziness going on!).
> Unfortunately, there is no real way of indicating in a simple DC record
> that the work (as opposed to the manifestation) is being described - this
> would be difficult even in qualified DC currently, because the current
> DCMI Type vocabulary doesn't allow us to make those distinctions.  But, in
> principle, the DC model is rich enough to handle this complexity - if
> we are prepared to put the effort in to agree how to do it.
> But the situation is even more complex than that because it is not clear
> to me where OAI resources and records sit within the Web architectural
> model of 'resources' and 'representations'.  My suspicion is that the FRBR
> 'manifestation' is the equivalent of the Web architecture
> 'represresentation' of the FRBR 'work' (if you see what I mean!).  The
> oai_dc record (and indeed the jump-off page) is a 'representation' of
> the 'work' (assuming that is what is being described).  But at this point
> we almost certainly need a diagram or two! :-(
> OK, so on then to the question about whether the protocol can and/or
> should be used to exchange 'resources' as well as 'metadata' about
> 'resources'.
> The protocol spec is very explicit in differentiating 'resources' from
> 'items' and 'records' and makes it very clear that the protocol be used to
> exchange 'metadata' between services - I'm thinking of section 2.2 in
> particular.  Now, with hindsight, I really wish we'd talked instead about
> 'resources' and 'representations' rather than resources, items, records
> and metadata, because that would have given us much more flexibility about
> what we do with the protocol.  But we didn't - and therefore, I think we
> are constrained in terms of what we can do within the semantics of the
> protocol spec.
> This is not just to do with the words being used in the spec.  It has to
> do with the entities in the model used by the protocol and the identifiers
> that are assigned to those entities.  An oai-identifier, for exanmple, is
> an identifier of an 'item', not of a 'resource' (in terms of the protocol
> usage of those words).  It seems to me that things are likely to become
> very fuzzy if the 'item' or 'record' suddenly becomes the 'resource' and
> vice versa.
> So, based on this, it seems to me that the protocol will 'break' if we
> start using it to carry the 'resource' where the protocol expects to see
> the 'record about the resource'.
> Now, your complex example of the METS package or the MPEG-21 DIDL is an
> interesting case - because those things can be used to carry both the
> metadata and the object.  Is a METS package the 'resource' or the 'record'
> in OAI terms?  The answer is that it is somewhere in-between.  I certainly
> accept that the METS package is a 'representation' of a 'resource' - but,
> as I mentioned above, unfortunately we didn't use the words 'resource' and
> 'representation' in the protocol spec.  Yes, the complex package can be
> viewed as metadata - but metadata about what - about the 'work' that the
> objects in the package 'represent', or about the particular manifestations
> contained in the package??!
> All in all, I think I'm happy with the case where OAI is used to carry the
> METS or DIDL package that contain objects - but I would be much less happy
> with a situation where the OAI-PMH is used to carry individual
> manifestations (an XHTML document for example).  But the fuzziness between
> the package and the item worries me and I'm not sure that we are going to
> be able to tell them apart very easily in all cases.
> Enough for now...  I agree with you that much more discussion and thinking
> about these issues is required.  I'm certainly happy (and indeed
> expecting) to be told I'm wrong about any or all of the above! :-)
> Regards,
> Andy.
>>* The scenario as described in the propsoal, in which a single metadata record
>>corresponds with a single "preprint" is only a special case of - future -
>>reality.  Increasingly, objects held in and described by repositories will be
>>"compound" or "complex", i.e. consisting of multiple datastreams, not just a
>>single "preprint".  I find that it would be desirable that a solution to get to
>>the content would be able to handle such situations.  The proposed solution
>>could actually accomodate such 'compound' objects, because the mutliple
>>datastreams are linked off the jump-off page.  There is, however, a problem.
>>Let's presume we have a situation in which an object is deposited in an
>>institutional repository that has 2 datastreams, each of which actually has a
>>unique identifier, say a doi or something.  Thinking of a - future -
>>self-archiving scenario and the trend to accord identifiers at finer levels of
>>granularity, this is not unlikely at all.  Now we get 3 things in dc.identifier
>>(2 doi's and a link to a jump-off page), and 2 things in the jump-off page
>>(links to the 2 datastreams).  How do I know which doi goes with which
>>datastream?  Information that - I hope we will all agree - is rather significant.
>>OK.  The point I am trying to make is that the described scenario and its more
>>general problem domain (beyond eprints, and into the realm of objects with
>>multiple datastreams) may call for another approach.  Our research has shown
>>that such an approach can remain 100% OAI-PMH-based if a complex object format
>>such as METS, MPEG-21 DIDL or SCORM is used.  These formats can be "parallel"
>>OAI-PMH "metadata formats" through which harvesters can get to the content
>>without running into issues such as the ones mentioned above.  Content can be
>>embedded in the XML wrappers or pointed at by them.  Identifiers can be
>>unambiguously connected to content.  If content changes, the datstamp of the
>>"conplex" record changes.
>>I anticipate concerns re the overhead of introducing a solution based on a
>>complex object format.  At this point, I would like to say 2 things with this
>>* It took 2 people on my team about 2 days to create a prototype plug-in that
>>enables OAI-PMH harvesting of content from DSpace repositories.  Our plug-in
>>rendered content using the MPEG-21 DIDL XML wrapper format.  Most of the time
>>invested in this plug-in was spent figuring out the DSpace API and a sensible
>>way to map the DSpace data model to the DIDL data model.  The prototype was
>>demonstrated at the DSpace federation meeting, last week.  Although
>>questions/issues did arise in the course of our work, non seemed unsolvable.
>>But it is my impression that the very fast delivery of a prototype indicates the
>>feasibility of the complex format approach.
>>* I would personally be very willing to spend time with the apporpiate
>>representatives of the community - including yourself - to work towards a
>>solution that is future-proof and provides adequate guarantees regarding
>>perceived requirements of a content-harvesting solution.  I would actually
>>prefer that over going for a solution which is attractive at first glance
>>because of its obvious simplicity, but which seems to raise some relevant
>>questions upon closer inspection.
>>To end, I would like to thank you for bringing this topic to the list.  I have
>>had many private email exchanges over the last few months especially with
>>representatives from DARE and DINI about this and related problem domains.  I
>>hope that your mail can be another impulse towards a joint action in this realm.
>>  The problem is very real, and I would love our community to jointly create a
>>really good solution to it.
>>many greetings
>>Andy Powell wrote:
>>>The JISC-funded ePrints UK project has a requirement to automatically
>>>harvest both metadata and full-text from the eprint archives within UK
>>>academia (and potentially elsewhere).  This is so that we can pass both
>>>metadata and full-text to the various 'enhancement' Web services offered
>>>by our partners.
>>>In order for our harvesting robot to be able to do this, it must be able
>>>to reliably (and automatically) determine the correct URL(s) for the
>>>various full-text manifestation(s) (HTML, PDF, RTF, etc.) of each eprint.
>>>Our "Using simple Dublin Core to describe eprints" guidelines are intended
>>>to encourage greater consistency in the metadata that is exposed by eprint
>>>archives using the 'oai_dc' format within the OAI Protocol for Metadata
>>>Harvesting (OAI-PMH).  Somewhat perversely, because we stick rigidly to
>>>the semantics of the DC element set, our guidelines make determining the
>>>URL of each manifestation that is available quite difficult.  (This is
>>>largely a consequence of the 'simple' nature of 'simple DC'!).  In
>>>general, the URL in the <dc:identifier> element of the oai_dc record is
>>>the URL of a jump-off page, rather than a direct link to the full-text.
>>>We would like to suggest a new proposal for unambiguously embedding the
>>>URL for each manifestation of an eprint into the (X)HTML jump-off page for
>>>that eprint.  Since the jump-off page is generated automatically by the
>>>eprint archive software, doing this shouldn't be too difficult (in fact,
>>>we would hope that archive software, such as eprints.org, will be
>>>configured to do this out of the box).
>>>If this proposal is adopted, it will make it much easier to write OAI
>>>service provider software that can reliably gather the full-text of an
>>>eprint, given only the oai_dc record for that eprint.
>>>The proposal is at
>>>Comments are welcome,
>>>Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
>>>http://www.ukoln.ac.uk/ukoln/staff/a.powell/      +44 1225 383933
>>>Resource Discovery Network http://www.rdn.ac.uk/
>>>ECDL 2004, Bath, UK - 12-17 Sept 2004 - http://www.ecdl2004.org/
>>>OAI-implementers mailing list
>>>List information, archives, preferences and to unsubscribe:
>>Herbert Van de Sompel
>>digital library research & prototyping
>>Los Alamos National Laboratory - Research Library
>>+ 1 (505) 667 1267 / http://lib-www.lanl.gov/~herbertv/
>>"met gestreken jeans de dansvloer penetreren"
> Andy
> --
> Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
> http://www.ukoln.ac.uk/ukoln/staff/a.powell/      +44 1225 383933
> Resource Discovery Network http://www.rdn.ac.uk/
> ECDL 2004, Bath, UK - 12-17 Sept 2004 - http://www.ecdl2004.org/
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers

Herbert Van de Sompel
digital library research & prototyping
Los Alamos National Laboratory - Research Library
+ 1 (505) 667 1267 / http://lib-www.lanl.gov/~herbertv/

"met gestreken jeans de dansvloer penetreren"