[OAI-implementers] OAI Aggregator

Thu, 14 Feb 2002 12:55:35 -0000

Apologies for long email ...

----- Original Message -----
From: "Alan Kent" <ajk@mds.rmit.edu.au>
To: "Tim Brody" <tim@tim.brody.btinternet.co.uk>; "OAI Implementors"
<oai-implementers@oaisrv.nsdl.cornell.edu>
Sent: Wednesday, February 13, 2002 11:21 PM
Subject: Re: [OAI-implementers] OAI Aggregator

> On Wed, Feb 13, 2002 at 12:43:24PM -0000, Tim Brody wrote:
>
> > Announcing the release of a beta OAI aggregating tool: OAIA
> >
> > Based on PERL and MySQL, OAIA is a _very_ simple mechanism for providing
> > caching and aggregating of OAI repositories.
>
> Having read the article
http://documents.cern.ch/ettdh/doc/public/OAIRSF.html
> which talks about hierarchical harvesting, is the idea then for this
> package to collect data from multiple data providers, then provide the
> data to multiple service providers?

It could do, yes.

The reason for writing OAIA was to alleviate the problem of DP9 overloading
data providers (which is especially troublesome because it is based on
GetRecord, rather than ListRecords requests).

OAIA could also be used to build unified community, or perhaps geographical,
specific collections. This alleviates the maintenance problems that global
SPs will have, as they will only need to harvest half a dozen DPs, compared
to potentially 1000s.

> If this is the case, should more work be done in terms of mapping out
> the relationship between different OAI repositories and copies? As
> a new person to this list, I just looked at the list of available sites
> and said "great, I will crawl them all!". But a recent mail I got
> indicated that one of the repositories was a copy (or included all
> of) another repository. This would seem to occur even more often with
> OAIA-like packages becoming available.

I don't see there being a big problem with harvesting the same records from
multiple sources, as long as:
a) Datestamps are always updated to the day of harvest, or the day the
record was changed
b) Harvesters are discerning about what they harvest

(I have built an OAI export for web-logs but would you want to harvest it,
even if it is original?)

> (1) Improve the sophistication of the global XML document listing
>     various OAI repositories, showing how they inter-relate.

Sounds too complex. OAI should (eventually) cluster around communities,
which will solve this problem to a large extent. At the moment the coverage
is too fragmented to become self-organising - with the notable exception of
OLAC.

> (2) Extend the XML of the Identify return to (optionally) include
>     details such as 'I have local data', and 'I also have data
>     crawled from this other site using this query (set name)'.
>     The default assumption would be its local data.
>
> Putting it into the Identify command would avoid registration complexity.

Done (kind of):
http://citebase.eprints.org/cgi-bin/oai?verb=Identify

> Does the aggregator keep the original identifiers for metadata?

Yes. So you could compare the repositoryName (as returned by Identify) to
the record identifiers its returning, to work out which records are local,
and which re-exported.

> Does an instance of OAIA get registered as a new repository?

It could do (of existing aggregators citebase is, arc isn't - but then
citebase is also a hidden-augmentor ...).

> Would this imply a site can return metadata with
> an identifier from a different site? Would this in turn mean that
> harvesters need to be careful - if they harvest from 2 OAIA sites,
> which both harvest from the same original site, where one OAIA site
> more up to date than the other then you may get old metadata back.

Assuming above caveat a) is adherred to, you should just compare datestamps
and take the newer one. Things get complex if one aggregator is changing the
metadata, while another one isn't - an issue that the technical folks in OAI
2.0 were thinking about. The idea was proposed that the identifier be
changed if a harvester alters the metadata, then re-exports - then the
problem is how to resolve multiple near-duplicate records.

> Makes deletes a bit more tricky too.
> Its not safe just to delete the local copy. I really need to cache
> the delete notification to be able to compare date/time stamps.

I don't treat a status=deleted as an order to delete the record. arXiv.org
and EPrints.org both treat a deletion as a flag, so that should a user come
across a deleted record they don't get a 404, but a notification that what
they were looking for has been withdrawn, and why.
If you store the deletion as a metadata field, it will be handled by the
same datestamp test as the rest of the metadata.

All the best,
Tim.