[OAI-implementers] new records in combination with a resumpti onToken

Young,Jeff jyoung@oclc.org
Tue, 29 May 2001 09:54:19 -0400


I suppose I should admit to the solution I took with our OAIHarvester and
OAICat software. Each time the OAIHarvester is run, it naturally uses the
'until' date from its last run as the 'from' date on the new run. If the OAI
server takes these dates literally, it's likely that duplicate records will
be served, as has been discussed. To get around this, I decided to have
OAICat subtract one from the 'until' date. The upside is, I don't serve any
duplicate records. The downside, of course, is that the server is up to a
day behind on the repository's contents. I'm probably being too strict about
duplicate records. Is there a best practices suggestion on this?

Liu, I'd also note that there is a possibility that a record could be
changed twice in one day; once before you harvest and once afterward. You
mention that the harvester could discard duplicate records by comparing the
datestamps, but that won't work in this case. Instead, you would have to
compare the entire record to insure all changes are accounted for. Simply
updating the local copy regardless of duplication, as you note, is perhaps
best.

Regards,
Jeff

> -----Original Message-----
> From: Xiaoming Liu [mailto:liu_x@cs.odu.edu]
> Sent: Friday, May 25, 2001 4:16 PM
> To: Hussein Suleman
> Cc: Jozef Kruger; OAI-impl (E-mail)
> Subject: Re: [OAI-implementers] new records in combination with a
> resumptionToken
> 
> 
> Hi, Hussein and Jozef,
> 
> First, in the arc service provider, we actually used the method you
> mentioned below.
> 
> > b) always ask for a 2 day overlap ending on the current date
> 
> Except that the "2 day" is cutomizable.
> 
> Secondly, about Jozef's question, I thought duplicate records also
> happen in other scenarios of OAI request.
> (a) same record belongs to different sets, and harvester 
> harvests by set.
> (b) Record is changed since last harvest. So datestamp is 
> changed but ID
> is intact.
> 
> Harvester has to deal with duplicate records anyway, it could simply
> update local copy, or check datestamp first, then do update 
> if necessary,
> depending on the overhead of reindex.
> 
> So I believe it's not necessary to explictly avoid the 
> scenario in your
> application.
> 
> 
> regards,
> liu
> 
> 
>  
> 
> 
> 
> 
> On Wed, 23 May 2001, Hussein Suleman wrote:
> 
> > ntent-transfer-encoding: 7bit
> > 
> > hi
> > 
> > this is an interesting problem so im going to share some of our
> > discussions here at virginia tech that are relevant to this 
> problem ...
> > 
> > of course there is no general solution since i think the OAI quite
> > deftly avoided handling too much complication in the 
> protocol ... that
> > said, there are two very interesting "solutions", one of which is
> > probably relevant to you:
> > 
> > firstly, i recall a while back someone (cant remember who) 
> related how
> > they implemented the protocol by making a temporary table to support
> > resumptions ... this would probably solve your problem but 
> would require
> > a bit more work ...
> > 
> > the alternative is to consider how service providers work 
> (at least this
> > is how we thought it through when building our first experimental
> > harvester):
> > 
> > a) since you can always add records at any time during the 
> day and the
> > granularity of harvesting is a day, you cannot trust data 
> you got on the
> > same day.
> > 
> > b) since dates are local to different timezones, if the 
> data provider is
> > west of the service provider, asking for everything up 
> until yesterday
> > is not "interoperationally stable" since it could still be 
> yesterday at
> > the data provider.
> > 
> > now there are multiple solutions to this and we tried 
> implementing some:
> > a) dont get anything newer than 2 days old
> > b) always ask for a 2 day overlap ending on the current date
> > c) use a 1-day overlap and operate in the timezone of the 
> data provider
> > (extract an initial responseDate from the data provider and then
> > increment locally)
> > 
> > as far as we can figure, any service provider that wants to avoid
> > missing data entries has to do something like this ... 
> since new data is
> > not "stable" for harvesting it is not trusted and/or not harvested
> > immediately and your problem of database updates pretty 
> much disappears
> > as long as harvesting is by date (which i trust it almost always is)
> > 
> > ok, i know this is probably way too much detail for this 
> question :) but
> > i just wanted to share these thoughts to see if they 
> aligned with the
> > harvesting approaches used by other people building service provider
> > interfaces ...
> > 
> > any further comments will be appreciated ...
> > 
> > ttfn
> > ----hussein
> > 
> > -- 
> > 
> ==============================================================
> ==========
> > hussein suleman -- hussein@vt.edu -- vtcs -- 
> http://purl.org/net/hussein
> > =========================================
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>