[OAI-implementers] issues with OAI-PMH specifications for OAI-Provider implementations using a cache

Michael Nelson mln at cs.odu.edu
Tue Jun 2 11:15:13 EDT 2009


unless I'm misunderstanding the problem, I think you *have* to use the 
"workaround" mentioned below -- otherwise the repository is not really 
being honest about its updates.  if the cache updates occur only at T0 and 
T3, the repository can't claim any datestamps of T1 or T2.  The records 
may have entered the repo somewhere at T2, but they did not surface to 
OAI-PMH until T3.

the harvester shouldn't have to care about how the repo is managing its
data structures (caches, real-time accesses, etc.)

I think the distinction is in separating repo datestamps (e.g., ingest) 
and OAI-PMH datestamps.

regards,

Michael

On Tue, 2 Jun 2009, Fridman, Rozita wrote:

> Hello Simeon,
>
> thanks a lot for your quick response.
>
>> The notion of including an explicit start-next-incremental-harvest-from
>> date
>> in the response is something I have thought about too. It would solve
>> the
>> cache problem you describe. Not sure how much support there would be
>> for such
>> a change, what do others think?
>
> Hopefully we will get support from other OAI-developer to extend a schema for the OAI-PMH response.
>
>
>> One way to solve this using the current protocol without modification
>> is to
>> use days granularity and to make sure that the cache is updated at
>> least once
>> within each day (and that the the update does not span a day boundary
>> in UTC).
>> That way T1=T2 always using your example.
>
> It is a good solution until we get a protocol enhancement. But the problem is when a cache update has not run for 1 day (for example because an underlying repository was not available) a harvester will miss records for that day.
>
> Now we use the same workaround, that fedora-OAI-Provider uses: we deliver records based on update time in a cache and not on original update time of records in an underlying repository. But this approach requires us to change the earliestDatestamp entry contained in a OAI-PMH Identify-response. It have to be set to a time of the first cache update and not to original earliest time stamp in the underlying repository. Otherwise a harvester will possibly miss changes in the time range between earliest time stamp in the underlying repository and the first cache update time.
>
>> If you opted to follow the 503 route then you could issue a
>> second/multiple
>> 503's if the harvester comes back before the update is complete. This
>> is
>> really the only good approach if the cache is in an inconsistent state
>> such
>> that the idempotency requirements of the protocol are not met.
>>
>
> Yes, it is an option.
>
> Best regards,
> Rozita
>
>> Cheers.
>> Simeon
>>
>>
>>
>> Fridman, Rozita wrote:
>>> Hello all,
>>>
>>> we developed an OAI-Provider for Escidoc repositories.
>>> Escidoc-OAI-Provider is based on the Fedora-OAI-Provider, which uses
>> a
>>> cache to reduce a response time. Escidoc repositories intend to
>> contain
>>> multiple millions of objects. The Escidoc-Core framework only
>> requires
>>> that objects metadata stored in a Escidoc repository are well formed
>>> xml-structures. Therefore using of a cache in the Escidoc-OAI-
>> Provider
>>> is essential to ensure validness of metadata in OAI-PMH response and
>> an
>>> acceptable response time.
>>>
>>> But the current OAI-PMH protocol specification doesn't account for
>> some
>>> issues, caused by the employment of a cache.
>>>
>>> The main problem is a time lag between a harvester request and a last
>>> cache update:
>>> A harvester asks the OAI-Provider for all records that have changed
>>> between T0 and T2 in the underlying repository. The last cache update
>>> was at T1.The harvester gets records that have changed between T0 and
>>> T1, but assumes that it got all changes between T0 and T2. Therefore
>> in
>>> the next request it asks for records that have changed between T2 and
>> T3
>>> and is missing all changes between T1 and T2. If cache update
>> interval
>>> is long and the next cache update takes place after T3, the harvester
>> is
>>> also missing all changes between T2 and T3 and so on.
>>>
>>> One proposal would be to put a date stamp of the last cache update
>> into
>>> the OAI-PMH response, in order to inform a harvester about possibly
>>> missed records.
>>>
>>> Does anybody face the same problem? What do you think about it? Maybe
>>> there are better solutions for this problem?
>>>
>>> The other issue is that depending on the OAI-Provider implementation
>> a
>>> cache may be in an inconsistent state while a cache update process is
>>> running. Are there means in the OAI-PMH protocol to respond to
>> harvester
>>> requests during a cache update? A possible solution would be to
>> respond
>>> with a HTTP-status code 503-Service unavailable (section 3.1.2.2 of
>> the
>>> specification), but the problem is to specify Retry-After period. A
>>> duration of the cache update is not constant, it depends on the
>> changes
>>> in the repository.
>>>
>>> Thanks a lot,
>>> Rozita
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>> ---
>>>
>>>
>>>
>>> -------------------------------------------------------
>>>
>>> Fachinformationszentrum Karlsruhe, Gesellschaft für
>> wissenschaftlich-technische Information mbH.
>>> Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim
>> HRB 101892.
>>> Geschäftsführerin: Sabine Brünger-Weilandt.
>>> Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>> ---
>>>
>>> _______________________________________________
>>> OAI-implementers mailing list
>>> List information, archives, preferences and to unsubscribe:
>>> http://www.openarchives.org/mailman/listinfo/oai-implementers
>>>
>
>
>
> -------------------------------------------------------
>
> Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische Information mbH.
> Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 101892.
> Geschäftsführerin: Sabine Brünger-Weilandt.
> Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
>
>
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://www.openarchives.org/mailman/listinfo/oai-implementers
>

----
Michael L. Nelson mln at cs.odu.edu http://www.cs.odu.edu/~mln/
Dept of Computer Science, Old Dominion University, Norfolk VA 23529
+1 757 683 6393 +1 757 683 4900 (f)


More information about the OAI-implementers mailing list