[OAI-implementers] new records in combination with a resumptionToken

Henry Stern henry@stern.ca
Thu, 24 May 2001 09:11:44 -0300


In the new version of the CIMI repository, we use temporary tables managed
by a Java program.  Only the keys for the metadata are stored in the
temporary table.  Every request involving a resumptionToken results in a
LEFT JOIN on the temporary table and the metadata table with a limit and
offset.  You can tell which records have been deleted because their metadata
is null (because of the left join).

Some sample SQL:
CREATE TABLE temporarya83lkdsfj AS SELECT some_key FROM metadata_table WHERE
modification_time > '2000-05-24';
SELECT metadata_table.metadata_field FROM temporarya83lkdsf LEFT JOIN
metadata_table ON temporarya83lkdsf.some_key = metadata_table.some_key LIMIT
100, 100;

It should be noted that the temporary table has no keys or indices, nor was
the distinct keyword used.  It quadruples the time of the query and has no
benefit for the left join, which obviously runs in linear time.

The main advantage to this method is that the query only needs to be run
once.  Joining a subset of a table to another table based on its primary key
is pretty quick.

The disadvantage to this method is that if a massive amount of hits come in
at once, MySQL chokes up.  The CPU won't be working at full speed nor will
the disk so it might be some sort of internal deadlocking issue.  It does,
eventually, clear up.

What were some other solutions?

Henry

> -----Original Message-----
> From: oai-implementers-admin@oaisrv.nsdl.cornell.edu
> [mailto:oai-implementers-admin@oaisrv.nsdl.cornell.edu]On Behalf Of
> Hussein Suleman
> Sent: May 23, 2001 9:33 PM
> To: Jozef Kruger
> Cc: OAI-impl (E-mail)
> Subject: Re: [OAI-implementers] new records in combination with a
> resumptionToken
>
>
> hi
>
> this is an interesting problem so im going to share some of our
> discussions here at virginia tech that are relevant to this problem ...
>
> of course there is no general solution since i think the OAI quite
> deftly avoided handling too much complication in the protocol ... that
> said, there are two very interesting "solutions", one of which is
> probably relevant to you:
>
> firstly, i recall a while back someone (cant remember who) related how
> they implemented the protocol by making a temporary table to support
> resumptions ... this would probably solve your problem but would require
> a bit more work ...
>
> the alternative is to consider how service providers work (at least this
> is how we thought it through when building our first experimental
> harvester):
>
> a) since you can always add records at any time during the day and the
> granularity of harvesting is a day, you cannot trust data you got on the
> same day.
>
> b) since dates are local to different timezones, if the data provider is
> west of the service provider, asking for everything up until yesterday
> is not "interoperationally stable" since it could still be yesterday at
> the data provider.
>
> now there are multiple solutions to this and we tried implementing some:
> a) dont get anything newer than 2 days old
> b) always ask for a 2 day overlap ending on the current date
> c) use a 1-day overlap and operate in the timezone of the data provider
> (extract an initial responseDate from the data provider and then
> increment locally)
>
> as far as we can figure, any service provider that wants to avoid
> missing data entries has to do something like this ... since new data is
> not "stable" for harvesting it is not trusted and/or not harvested
> immediately and your problem of database updates pretty much disappears
> as long as harvesting is by date (which i trust it almost always is)
>
> ok, i know this is probably way too much detail for this question :) but
> i just wanted to share these thoughts to see if they aligned with the
> harvesting approaches used by other people building service provider
> interfaces ...
>
> any further comments will be appreciated ...
>
> ttfn
> ----hussein
>
> --
> ========================================================================
> hussein suleman -- hussein@vt.edu -- vtcs -- http://purl.org/net/hussein
> ========================================================================
>