[OAI-implementers] new records in combination with a resumptionToken

Xiaoming Liu liu_x@cs.odu.edu
Fri, 25 May 2001 16:16:10 -0400 (EDT)


Hi, Hussein and Jozef,

First, in the arc service provider, we actually used the method you
mentioned below.

> b) always ask for a 2 day overlap ending on the current date

Except that the "2 day" is cutomizable.

Secondly, about Jozef's question, I thought duplicate records also
happen in other scenarios of OAI request.
(a) same record belongs to different sets, and harvester harvests by set.
(b) Record is changed since last harvest. So datestamp is changed but ID
is intact.

Harvester has to deal with duplicate records anyway, it could simply
update local copy, or check datestamp first, then do update if necessary,
depending on the overhead of reindex.

So I believe it's not necessary to explictly avoid the scenario in your
application.


regards,
liu


 




On Wed, 23 May 2001, Hussein Suleman wrote:

> ntent-transfer-encoding: 7bit
> 
> hi
> 
> this is an interesting problem so im going to share some of our
> discussions here at virginia tech that are relevant to this problem ...
> 
> of course there is no general solution since i think the OAI quite
> deftly avoided handling too much complication in the protocol ... that
> said, there are two very interesting "solutions", one of which is
> probably relevant to you:
> 
> firstly, i recall a while back someone (cant remember who) related how
> they implemented the protocol by making a temporary table to support
> resumptions ... this would probably solve your problem but would require
> a bit more work ...
> 
> the alternative is to consider how service providers work (at least this
> is how we thought it through when building our first experimental
> harvester):
> 
> a) since you can always add records at any time during the day and the
> granularity of harvesting is a day, you cannot trust data you got on the
> same day.
> 
> b) since dates are local to different timezones, if the data provider is
> west of the service provider, asking for everything up until yesterday
> is not "interoperationally stable" since it could still be yesterday at
> the data provider.
> 
> now there are multiple solutions to this and we tried implementing some:
> a) dont get anything newer than 2 days old
> b) always ask for a 2 day overlap ending on the current date
> c) use a 1-day overlap and operate in the timezone of the data provider
> (extract an initial responseDate from the data provider and then
> increment locally)
> 
> as far as we can figure, any service provider that wants to avoid
> missing data entries has to do something like this ... since new data is
> not "stable" for harvesting it is not trusted and/or not harvested
> immediately and your problem of database updates pretty much disappears
> as long as harvesting is by date (which i trust it almost always is)
> 
> ok, i know this is probably way too much detail for this question :) but
> i just wanted to share these thoughts to see if they aligned with the
> harvesting approaches used by other people building service provider
> interfaces ...
> 
> any further comments will be appreciated ...
> 
> ttfn
> ----hussein
> 
> -- 
> ========================================================================
> hussein suleman -- hussein@vt.edu -- vtcs -- http://purl.org/net/hussein
> =========================================