[OAI-implementers] OAI Aggregator

Thu, 14 Feb 2002 10:21:49 +1100

On Wed, Feb 13, 2002 at 12:43:24PM -0000, Tim Brody wrote:
> Dear All,
> 
> Announcing the release of a beta OAI aggregating tool: OAIA
> 
> Based on PERL and MySQL, OAIA is a _very_ simple mechanism for providing
> caching and aggregating of OAI repositories.

Having read the article http://documents.cern.ch/ettdh/doc/public/OAIRSF.html
which talks about hierarchical harvesting, is the idea then for this
package to collect data from multiple data providers, then provide the
data to multiple service providers?

If this is the case, should more work be done in terms of mapping out
the relationship between different OAI repositories and copies? As
a new person to this list, I just looked at the list of available sites
and said "great, I will crawl them all!". But a recent mail I got
indicated that one of the repositories was a copy (or included all
of) another repository. This would seem to occur even more often with
OAIA-like packages becoming available.

There are several possible strategies I could think of quickly (I am
sure others have been thinking longer about it):

(1) Improve the sophistication of the global XML document listing
    various OAI repositories, showing how they inter-relate.

(2) Extend the XML of the Identify return to (optionally) include
    details such as 'I have local data', and 'I also have data
    crawled from this other site using this query (set name)'.
    The default assumption would be its local data.

Putting it into the Identify command would avoid registration complexity.

Hmmm, serious question time!

Does the aggregator keep the original identifiers for metadata (or
assign new local identifiers)? Does an instance of OAIA get registered
as a new repository? Would this imply a site can return metadata with
an identifier from a different site? Would this in turn mean that
harvesters need to be careful - if they harvest from 2 OAIA sites,
which both harvest from the same original site, where one OAIA site
more up to date than the other then you may get old metadata back.
This means a harvester can no longer blindly (like mine! :-) crawl
sites and rely on the sites returning data in an appropriate order.
The harvester must compare the date on the retrieved record to the
date on the local cached copy of the record to make sure the data
(or delete request!!) is more up to date than the local data.

Maybe I should be doing this anyway. Since lists are not ordered,
and I cannot remember any guarantee that a list may not contain
two updates to the same record. Makes deletes a bit more tricky too.
Its not safe just to delete the local copy. I really need to cache
the delete notification to be able to compare date/time stamps.

All very interesting.

Alan