[OAI-implementers] (no subject)
thb-oai-implementers at lists.gymel.com
Thu May 26 08:22:05 EDT 2011
> > Another issue that came up recently has to do with incremental harvesting. The
> > harvester guidelines mention that for the value of the from parameter, the
> > `responseDate` should be used, and that it is advisable to overlap by a small
> > additional amount.
> > I think it would be better if a harvester does not use the responseDate, but
> > instead uses the latest datestamp of all harvested records.
> > Consider the following situation:
> > Someone modifies a document in a database at 4 o'clock.
> > An external OAI service gets updated once an hour, so it will have the changes
> > at 5 o'clock. The OAI software will use the modification dates from the
> > database, so at 5 o'clock the modified record is added with a datestamp of 4
> > o'clock.
Wich in turn evoces the fatal consequences you describe.
Whenever you have an intermediate service as base for the repository,
you have two choices:
- Keep copies and on update of each individual record modify its
datestamp to the time of update (or - if you get a copy of everything:
modify the datestamps of all records newer than the last update
- Keep a *complete* list of all individual times an update of the
service has taken place and adjust all internal queries for time
intervals and all Datestamps in OAI-Headers to the correct (upper
or lower) interval boundary from this list.
Personally I'm involved with repositories for which unfortunately none
of these strategies is feasible: The OAI service does not have a
database of its own and the database it utilizes is updated infrequently
by prompting administrators to upload a "production version" of the
database onto the web-exposed host. And none of the persons involved
takes the trouble adding the timestamp of this action to a config file.
But also better-kept repositories sometimes have trouble with their
indexing and erroneously deliver no records for intervals where there
actually had been some changes, thus arising the need to re-harvest.
Unfortunately the protocol specification does not include measures to
communicate such reharvesting instructions, the known harvesters then
are alerted by a mailing list, but of course OAI-PMH is mostly
about giving access without the need of prior "registration"...
The strategy you describe fits the above scenarios very well and
can be implemented in harvesters very cheaply: For incremental harvesting
the timestamp of "last successful harvest" has to be stored anyway and
noting the first (of several delivered) ResponseDates or the maximum
of all delivered dateStamp's does not make much of a difference.
Semantically it would constitute the "evidenced last known state"
of the repository...
More information about the OAI-implementers