FW: [OAI-implementers] Open Archives Initiative Protocol for Meta data Harvesting Version 2 news

Xiaoming Liu liu_x@cs.odu.edu
Thu, 7 Feb 2002 23:59:11 -0500 (EST)


Alan,

I guess there are two aspects of my arguments,(DP) data provider and
(SP) service provider.

From the side of SP, it could not presume "a request for the past will
always get the same answer". So the method suggested by Walter won't work.
Instead, SP has to use the resumptionToken to get the right anwser.

From the side of DP, they could implement the resumptionToken by its own
way. If DP can promise "a request for the past will never change", or
they don't care missing something, they can use the method I suggest.
That's the case for CVS-like system (keep each version with different
release number), or maybe some historical documents.

So my opinion is: SP has to use resumptionToken, DP has its own options
about how to implement it. 


About "whether new records are created with monotomic dates" See
definition of datestamp in OAMHP:
"A datestamp is the date of creation, deletion, or latest date of
modification of an item, the effect of which is a change in the metadata
of a record disseminated from that item."

So in a correctly-implemented OAI repository, the new records should be
created with monotomic dates, in your case of webpage/crawler, the date of
the metadata is the date of webpage is harvested.

> Or is the idea with OAI that if a record is updated, then the
> old slot is marked as 'deleted' and a new record added as 'inserted'
> to keep the same number of slots around?

If one record is changed (but identifier keeps same), the correct way is
to change the datestamp. However, if you have a version control system and
change identifier each time, the "deleted"/"inserted" is also a right way.

> The only invariant that I can think of is the date stamp.
> If date/time stamps (to a high resolution) were used, and the
> results of ListRecords was in monotomically increasing order
> of time, then you actually no longer need resumptionToken at all.

By my understanding, OAI2.0 (from Carl&Herbert's email) will support high
resolution date/time stamps as an option. However, there is no promise
that results of ListRecords will be in monotomically increasing order of
time. (It may be unnecessary limitation to some data providers). 

But I agree it will support a pure stateless protocol if all assumption
are satisfied (high resolution date stamps and results is ordered by
time).

Regards,
liu








On Fri, 8 Feb 2002, Alan Kent wrote:

> Sorry if this is all old hat to other people, but I find getting involved
> is the best way to learn and understand. People can always ignore me! :-)
> 
> On Thu, Feb 07, 2002 at 09:50:27PM -0500, Xiaoming Liu wrote:
> > --- Walter Underwood wrote:
> > > A request for all changes between two dates in the past should always get
> > > the same answer, so stateless harvesting should work.
> > 
> > This is a neat way, but I am now sure how well the past is kept in digital
> > library ;-) Especially
> > in OAI protocol, whenever a record is changed, its datestamp is changed
> > too.  So even a request
> > for past may not get the same answer.
> 
> and
> 
> > Maybe there is one way to implement a stateless protocol in current OAI:
> > encode query parameters in ResumptionToken:
> ...
> > one example is:
> > resumptionToken= 1999:2000:math:oai_dc:100
> 
> I assume the 100 means start from record 100.
> 
> So by your own argument, the contents of previous queries may change
> between requests. So the server *must* keep a copy of the state of the
> system when the original query was issued and continue to provide
> that consistently to the client. If the results are not consistent,
> data could be lost (overlooked) during a long transfer.
> 
> Let me expand and ask a few questions (partly from my ignorance).
> Is it expected with OAI that new records will come into existance
> at a previous point in time? Or are all new records always added
> created with monotomically increasing date/time values? For example,
> if metadata is harvested from a web site, would the dates of the
> web pages be used? Or the date the data was harvested be used?
> If the date of the web page, then when a new site is crawled,
> new pages can come into existence dated in the past. If the date
> the metadata was collected from the web page, then dates increase
> monotomically.
> 
> If new records are *not* created with monotomic dates, then OAI falls
> down doesn't it? Any one who has done a previous crawl may never crawl
> for that old date range again and so not get the data. So to be safe,
> dates must be monotomically increasing for metadata modified in the
> repository.
> 
> If changes to the repository are then always given monotomically
> increasing dates, then history will never be added to. However,
> history can be lost if an old entry is updated (as it will be given
> a newer date). So if a cursor scheme is used which says 'give me
> records starting from 100' is used, then if a record that was in
> the range 1-99 is updated between requests, then what was record
> number 100 would slip back to become record number 99. The request
> starting from 100 would then miss that record.
> 
> Or is the idea with OAI that if a record is updated, then the
> old slot is marked as 'deleted' and a new record added as 'inserted'
> to keep the same number of slots around?
> 
> The normal way this problem is addressed in database systems of
> course is to use transactions. When the query is used, the full
> answer is effectively worked out and kept around. Any updates,
> inserts, or deletes do not affect the query results. The current
> OAI protocol then uses the resumptionToken to identify the query
> set. But at some stage, the query may be discarded. If the client
> has not got all the data yet, then it has to start again from
> scratch (unless the data is guaranteed to be returned in monotomically
> increasing date order - which its not at present I think).
> 
> Using the identifier of a record to remember the position in a
> result set is no good either. If that record is updated, it will
> move in the result set, messing things up again.
> 
> The only invariant that I can think of is the date stamp.
> If date/time stamps (to a high resolution) were used, and the
> results of ListRecords was in monotomically increasing order
> of time, then you actually no longer need resumptionToken at all.
> Instead, a new request can be specified with a precise 'from'
> value. That would make requests completely stateless. Deletions
> in history (due to an update) would not be a problem.
> 
> Ok, I will be quiet now and let someone with more history behind
> OAI and all its goals etc speak instead.
> 
> Alan
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>