[OAI-implementers] on resource harvesting & datestamps

Mon Mar 7 06:25:41 EST 2005

thanks for the exhaustive reply.
here are some further comments.

> It depends on how your CO is constructed.  For example, if your CO
provides the following datastreams by-value:
>        MARC + TIFFs + OCR + PDF
> and you change your MARC -> DC mapping, then the changes will be limited
to oai_dc records and not oai_didl, saving you the penalty of downloading
the
> large CO.  

I am afraid I don't follow here. if DC is not in the CO example, how does
the change to the MARC->DC mapping relate to our discussion on CO support
for avoiding redundant harvesting? the assumption for that discussion is
that there is some metadata-only change to the CO, right?

> However, if you change your MARC in the above example, you will download
the entire CO again.  Also, if you edit any of TIFFs _or_ your OCR _or_ your
> PDF you will download the entire CO again. The reason is that the OAI-PMH
only knows a single datastamp, and that is the datetime of creation / 
> modification of the digital object as a whole.  as soon as one of the
constituent datastreams MARC, TIFF, OCR, PDF) changes, the datestamp of the
object
> changes.

indeed. that's exactly why I had the impression COs could well amplify the
problem of redundancy rather than solving it: changes to the 'metadata
parts' (MARC above) would force redundant harvesting of all 'content parts'
which are included by-value, whether the DIDL harvester planned to harvest
those resources or not.

> Having said that, one can imagine optimizations that would combine OAI-PMH
semantics and semantics _outside_ of the OAI-PMH.  For example, imagine
that:

> (*) the CO that represents our digital object provides all its datastreams
by-reference (instead of by-value)
> (*) we insert metadata into the CO that expresses the datetime of
creation/modification of each constituent datastream.

> Both can be achieved using a decent CO format....

sure. but what does that imply? seems to me we have eliminated what
distinguishes COs from 'simple' metadata records (the included content,
perhaps dynamically generated, perhaps not Web-accessible) and we've added
conventions/extensions to the use of CO formats at both ends of the data
exchange. this is certainly an option, but had we not started with wanting
to avoid conventional/extended use of formats in the first place? put
another way, why not add datestamps to dc.format and have harvesters inspect
those instead? wouldn't it be in fact a simpler proposition to present to
those harvester which have an interest in resorce harvesting and are
currently DC-based? 

In any case, note that the 'redundant harvesting' issue requires a
commitment from DPs to report resource changes, and that this commitment is
not only outside the scope of OAI-PMH, but also outside the scope of CO
formats as such. Multiple datestamps have to be mantained and their changes
have to be reported, and this follows from policy not, say, DIDL.

>> 2) more importantly, is the propagation of change from resources to 
>> metadata really dependent on the exchange format? Couldn't a provider 
>> use DC and yet enforce a strong versioning policy which translate 
>> changes to resource in new items (and thus records)? Even when (minor) 
>> changes are allowed to preserve the identity of the resources, and 
>> thus no versioning takes place, could not a provider reflect those 
>> changes in the datestamp of the associated metadata records?

>There are three problems:

> 1.  This introduces new, potentially confusing semantics for DC (and MARC,
etc.) and OAI-PMH.  If a "regular" harvester downloads a DC record that says
> it has changed on 2004-12-23, and it is bit equivalent with a the same
record that last had a datestamp of
> 2003-02-21 -- what would that mean exactly?
>
> 2.  Related to the above, you would have to convey to harvesters that you
"do" or "do not" adhere to "DC-update means resource updates"
> policy.  This could be done with a <description> container in an Identify
response, but this would likely be community-specific and hard to
generalize.

sorry, I did not make the context sufficiently clear. I was trying to point
out that (i) there is little in the definitional 'complexity' of CO formats
(i.e. the fact that they include content representations) which directly
solves the datastamps problems (redundant and incomplete harvesting), and
that (ii) a solution relies on DP policy and on a format which might expose
that policy to interested harvesters. 
As to the format, the comparison is thus between an ad-hoc DC extension and
a CO format, say DIDL. Whichever you go for, questions 1) and 2) admit the
same answers. For example, plain-DC harvesters should see 'cloned' records
no more than they should se DIDL records and no more that plain-DIDL
harvesters (use DIDL but don't care for resource harvesting) should see
'cloned' DIDL objects which *refer* to an updated PDF without including it.
They will not be asking for them and they will not be given any.
In both cases, we have got two communities of adoption, DC and DIDL, and a
set of conventions/extensions proposed to support a particular usage:
resource harvesting. To the members within the community which do not
follow/interpret those conventions, the 'clone' record cannot be explained
(not that it would be noticed, or harmful at that); to those who do,
however, it means what they agreed it should mean, that resources referred
to by the data have changed and should be re-harvested. 
If we instead assume that DIDL harvesters are by definition aware of the
conventions specific to resource harvesting via the OAI-PMH, wouldn't we
also deny that DIDL is a 'strong standard' and thus lose the obvious reason
to prefer it over some newly convened-upon DC extension?

> 3.  You still need to convey how to "grab all the stuff".  In the CO
example above, the value in DC.Identifier is probably just a "splash page";
how to
> grab the MARC, OCR, TIFFs, PDF etc. is not easily conveyed in a
generalizable manner.

sure. my issues were with the datestamp part of the argument, i.e. the
'locate resource change' part. again, my doubt was on the inherent virtues
of COs to solve datastamp-related problems of resource harvesting.

as to the 'locate resource' part of the argument, I agree that it would be
good thing if DPs converged towards a potentially strong standard as they
show will to cooperate towards the particular requirements of resource
harvesting. In this sense, it would be interesting to gauge the cost of
adopting a CO, say DIDL, within the infrastructure of the existing OAI
community (both server and client sides costs of course) and then to compare
that cost with that associated with a convened-upon DC extension or, at
that, with the cost associated with an ad-hoc protocol extension blessed by
the OAI for the purpose. For example, one might argue that a protocol
extension would talk to the same interested parties a CO format or DC
extension would. It would work smoothly with existing exchange formats,
orthogonally addressing directly the redundant and incomplete harvesting
problems. Further, it would not require that resources be Web-accessible to
be harvested. 

your paper, thoroughly enjoyable and insightful otherwise, does not seem to
address these cost issues and comparisons directly. Initially, I got the
impression COs would solve the problems of resource harvesting by virtue of
their definitional properties. seems to me the argument is all but
technical, for it has to do with growing a community around a strong
standard which *may* nonetheless prove relatively cheap to adopt wrt to the
alternatives. but does it?

regards,

fabio

##############################################
Fabio Simeoni 
Research Fellow
Department of Computer & Information Sciences
University of Strathclyde, Glasgow

TEL: +44 141 548 (3590)
FAX: +44 141 548 (4523)