[OAI-implementers] harvesting strategy

Simeon Warner simeon@lanl.gov
Fri, 25 May 2001 14:14:15 -0600 (MDT)


On Wed, 23 May 2001, Hussein Suleman wrote:
> [chopped section on resumptions and data-provider tables]
> 
> the alternative is to consider how service providers work (at least this
> is how we thought it through when building our first experimental
> harvester):

I've spent a little time thinking about harvesters so I thought I'd
comment on Hussein's notes.

> a) since you can always add records at any time during the day and the
> granularity of harvesting is a day, you cannot trust data you got on the
> same day.

agreed.

> b) since dates are local to different timezones, if the data provider is
> west of the service provider, asking for everything up until yesterday
> is not "interoperationally stable" since it could still be yesterday at
> the data provider.

agreed.

> now there are multiple solutions to this and we tried implementing some:
> a) dont get anything newer than 2 days old
> b) always ask for a 2 day overlap ending on the current date
> c) use a 1-day overlap and operate in the timezone of the data provider
> (extract an initial responseDate from the data provider and then
> increment locally)

I think something like option c) is best. As Hussein said, even when
working in the local timezone of the data-provider, one needs to harvest
records that changed on the same day as the last harvest was performed.
I suggest using the YYYY-MM-DD part of the responseDate from the
first reply to the last harvest's ListRecords/ListIdentifiers request
as the new 'from' date. I say 'first reply' to cope with ill-defined
behaviour if set of partial responses spanned a day boundary, and I
note that the responseDate must be in the local timezone of the
data-provider, with the offset from UTC appended (1.0spec. sec3.2).
The nice feature of this strategy is that it doesn't require the
harvester to know what the time is, and is insensitive to errors in
the repository time provided the datestamps and responseDates are
consistent.

Does anyone else have comments of different strategies?

Another thing I am thinking about is when the operator of a harvester
should be alerted to possible problems/changes requiring manual
intervention. So far I have come up with:
1) too many failures to reach site
2) unexpected HTTP replies
3) too many sequential redirect or retryAfter replies
4) change in Identity reply (other than responseDate) 
Comments?

Cheers,
Simeon


> 
> as far as we can figure, any service provider that wants to avoid
> missing data entries has to do something like this ... since new data is
> not "stable" for harvesting it is not trusted and/or not harvested
> immediately and your problem of database updates pretty much disappears
> as long as harvesting is by date (which i trust it almost always is)
> 
> ok, i know this is probably way too much detail for this question :) but
> i just wanted to share these thoughts to see if they aligned with the
> harvesting approaches used by other people building service provider
> interfaces ...
> 
> any further comments will be appreciated ...
> 
> ttfn
> ----hussein
> 
> -- 
> ========================================================================
> hussein suleman -- hussein@vt.edu -- vtcs -- http://purl.org/net/hussein
> ========================================================================
>