[OAI-implementers] OAI -- some issues

Tim Brody tim@tim.brody.btinternet.co.uk
Mon, 17 Dec 2001 10:28:58 -0000


You raise some interesting questions - I hope my answers provide some help.

----- Original Message -----
From: "K.M. KU" <kmku@hkusua.hku.hk>
To: <oai-implementers@oaisrv.nsdl.cornell.edu>
Sent: Monday, December 17, 2001 12:58 AM
Subject: [OAI-implementers] OAI -- some issues

>  - data harvesting (the scheme of harvesting data  periodically, or by
>  other ways ). The data harvesting mechanism is working in a  passive mode
>  (right?) It seems that if the number of data providers goes  big, the
> bandwidth
>  and time for updating metadata may be an issue to consider, correct?

If you presume that every research article produced was placed in an OAI
repository then it would be millions of new articles (and hence, metadata
records) every year. Considering that Google has managed to index 3 billion
web pages (which is considerably more text than an OAI record), I don't
think we will hit the limits of the underlying technology any time soon.

>  - a globalized metadata repository model for service  provider. If the
> number
>  of service providers goes big, that means every service  provider keeps a
> copy
>  of all metadata of data provider and is not quite storage 'efficent'.
>  Moreover, the process of data harvesting may overload data provider.

To answer your second point first, I believe there will be many aggregator
services, providing a single point of entry to many source data providers -
as this is a pretty simple mechanism to implement for service providers, who
are already downloading and storing OAI responses. ARC and CiteBase are both
current aggregators (i.e. have some kind of OAI export).

The only other alternative (?) to storing a copy of all the metadata is
distributed search, which is complex, has scaling problems, and places the
emphasis on processing (expensive) over storage (cheap).

> - if a record has been deleted permanently from the repository,data
> needs to keep a 'copy' of deleted data so it can return an attribute --
> 'delete' (right?). How can a data provider knows if ALL service providers
> updated the metadata?

I don't think it can (consider Google's cache, and the Internet Archive).
When storage is cheap, and information is precious, we shouldn't delete
anything, but just tag things as "deleted" (hopefully with a note to say

All the best,
Tim Brody