[OAI-implementers] Records getting out of sets and repository/harvester implementation

Diogo Mena Reis diogo.menareis at ist.utl.pt
Fri Oct 15 16:13:23 EDT 2010

Dear Samuele,

Wow, that is a real borderline scenario.

It seems you problem is a semantic one, but my understanding is that it's not contemplated in the protocol. Maybe implicitly. From the spec http://www.openarchives.org/OAI/openarchivesprotocol.html#deletion you only have the deleted bit, not the set associated.

On 14/Oct 2:06 PM, Samuele Kaplun wrote:
> Now suppose an item belongs to both setA and setB at some point in time.
> But then because of a change in the definition of setA (or because of an
> update of the record metadata) the record does no longer belong to setA,
> but continues to belong to setB.
> How can this information be given to harvester?

So, if you want to make sure that in the case where a harvester:
1. Harvests set B where record exists
2. Harvests set A where record was deleted

The record should be deleted in the set A in the harvester if it kept the same identifier for both (which it doesn't have to, so you can't tell). I think your best option in this case is to keep replying with the record in set B as updated for some time (weeks or months) even if it stays the same, to ensure that the following harvest retrieves the record again. If you only reply the record as belonging to set B, the harvester will never know it was removed from A.

> Another possibility is that the record is instead returned as such both
> when harvested via setB but also via setA, but the<setSpec>  in both
> harvesting session, for the given record, should only mention setB (as
> the record now belongs only to setB and no longer to setA).

That is a little subtle I think. I doubt anyone implements that. Getting a record that does not say it was deleted and checking if the set asked is not included is a little far fetched. I would first guess there was a bug in the server's implementation =) And I think I didn't get any reference to that in the spec.

> A smart harvester would then do the right thing, i.e. delete the record
> if it knows it is not harvesting setB or ignoring the record (as it will
> receive it anyway when it will harvest setB later)

Like I said I think it's not in the spec, so a smart harvester would send you an email telling you that you're giving records from the wrong set.

Actually I think there is an important gap in the OAI-PMH spec for "dirty" sets. Imagine that you: 1. must change all record identifiers (either by mistake or change in id policy) or 2. must change one field in all the records. There should be a way to tell: this set is dirty, harvest from scratch. What you must do (according to the spec) if you support persistent deleted records is 1. list ALL the records as deleted and list all the records with new ids and 2. List all the records as updated from the change date onward. If you have 10 million records in that set, that is a big change log you must keep. Forever.



More information about the OAI-implementers mailing list