[OAI-implementers] on resource harvesting & datestamps

Michael Nelson mln at cs.odu.edu
Sat Mar 5 11:12:20 EST 2005


> A couple of questions now:
>
> 1) how do complex object formats may help with the problem of redundant
> harvesting?

It depends on how your CO is constructed.  For example, if your CO
provides the following datastreams by-value:

        MARC + TIFFs + OCR + PDF

and you change your MARC -> DC mapping, then the changes will be
limited to oai_dc records and not oai_didl, saving you the penalty
of downloading the large CO.  However, if you change your MARC in
the above example, you will download the entire CO again.  Also,
if you edit any of TIFFs _or_ your OCR _or_ your PDF you will
download the entire CO again. The reason is that the OAI-PMH only
knows a single datastamp, and that is the datetime of creation /
modification of the digital object as a whole.  as soon as one of
the constituent datastreams MARC, TIFF, OCR, PDF) changes, the
datestamp of the object changes.

Having said that, one can imagine optimizations that would combine
OAI-PMH semantics and semantics _outside_ of the OAI-PMH.  For
example, imagine that:

(*) the CO that represents our digital object provides all its
datastreams by-reference (instead of by-value)

(*) we insert metadata into the CO that expresses the datetime of
creation/modification of each constituent datastream.

Both can be achieved using a decent CO format.  In this scenario,
the OAI-PMH datestamp of the CO triggers reharvest of a very
lightweight CO every time a constituent datastream changes.
Introspection of the harvested CO would reveal the creation /
modification datetime of each constituent datastream, and based
upon this, a decision can be made whether or not to collect the
datastreams that were provided by-reference.  This would be the
scenario:

1.  We do verb=ListRecords&metadataPrefix=oai_didl&from=2005-01-01

2.  Record1 is the only match

3.  We examine Record1 to find:

Record1 (all datastreams are by-ref):
         - MARC (2005-01-30)
         - OCR (2005-01-25)
         - TIFF1 .. TIFFN (2003-07-23)
         - PDF (2003-07-23)

4. We can now decide to only dereference the MARC and OCR.

> 2) more importantly, is the propagation of change from resources to metadata
> really dependent on the exchange format? Couldn't a provider use DC and yet
> enforce a strong versioning policy which translate changes to resource in
> new items (and thus records)? Even when (minor) changes are allowed to
> preserve the identity of the resources, and thus no versioning takes place,
> could not a provider reflect those changes in the datestamp of the
> associated metadata records?

There are three problems:

1.  This introduces new, potentially confusing semantics for DC
(and MARC, etc.) and OAI-PMH.  If a "regular" harvester downloads
a DC record that says it has changed on 2004-12-23, and it is bit
equivalent with a the same record that last had a datestamp of
2003-02-21 -- what would that mean exactly?

2.  Related to the above, you would have to convey to harvesters
that you "do" or "do not" adhere to "DC-update means resource updates"
policy.  This could be done with a <description> container in an
Identify response, but this would likely be community-specific and
hard to generalize.

3.  You still need to convey how to "grab all the stuff".  In the
CO example above, the value in DC.Identifier is probably just a
"splash page"; how to grab the MARC, OCR, TIFFs, PDF etc. is not
easily conveyed in a generalizable manner.

In summary, chosing a sufficiently rich CO format will prevent
the above problems and ambiguities.

regards,

Michael, Herbert, Simeon & Carl

----
Michael L. Nelson mln at cs.odu.edu http://www.cs.odu.edu/~mln/
Dept of Computer Science, Old Dominion University, Norfolk VA 23529
+1 757 683 6393 +1 757 683 4900 (f)



More information about the OAI-implementers mailing list