[OAI-implementers] issues with OAI-PMH specifications for OAI-Provider implementations using a cache

Dr R. Sanderson azaroth at liverpool.ac.uk
Tue Jun 2 11:20:33 EDT 2009



Agreed.

One could extend this situation ad absurdum with layers and layers of 
caches, each of which would need to add in their own timestamps.  If a 
harvester wants to be certain that it has everything, it could start at 
the most recent time stamp it has in its database.

Rob

On Tue, 2 Jun 2009, Michael Nelson wrote:

>
> unless I'm misunderstanding the problem, I think you *have* to use the
> "workaround" mentioned below -- otherwise the repository is not really
> being honest about its updates.  if the cache updates occur only at T0 and
> T3, the repository can't claim any datestamps of T1 or T2.  The records
> may have entered the repo somewhere at T2, but they did not surface to
> OAI-PMH until T3.
>
> the harvester shouldn't have to care about how the repo is managing its
> data structures (caches, real-time accesses, etc.)
>
> I think the distinction is in separating repo datestamps (e.g., ingest)
> and OAI-PMH datestamps.
>
> regards,
>
> Michael
>
> On Tue, 2 Jun 2009, Fridman, Rozita wrote:
>
>> Hello Simeon,
>>
>> thanks a lot for your quick response.
>>
>>> The notion of including an explicit start-next-incremental-harvest-from
>>> date
>>> in the response is something I have thought about too. It would solve
>>> the
>>> cache problem you describe. Not sure how much support there would be
>>> for such
>>> a change, what do others think?
>>
>> Hopefully we will get support from other OAI-developer to extend a schema for the OAI-PMH response.
>>
>>
>>> One way to solve this using the current protocol without modification
>>> is to
>>> use days granularity and to make sure that the cache is updated at
>>> least once
>>> within each day (and that the the update does not span a day boundary
>>> in UTC).
>>> That way T1=T2 always using your example.
>>
>> It is a good solution until we get a protocol enhancement. But the problem is when a cache update has not run for 1 day (for example because an underlying repository was not available) a harvester will miss records for that day.
>>
>> Now we use the same workaround, that fedora-OAI-Provider uses: we deliver records based on update time in a cache and not on original update time of records in an underlying repository. But this approach requires us to change the earliestDatestamp entry contained in a OAI-PMH Identify-response. It have to be set to a time of the first cache update and not to original earliest time stamp in the underlying repository. Otherwise a harvester will possibly miss changes in the time range between earliest time stamp in the underlying repository and the first cache update time.
>>
>>> If you opted to follow the 503 route then you could issue a
>>> second/multiple
>>> 503's if the harvester comes back before the update is complete. This
>>> is
>>> really the only good approach if the cache is in an inconsistent state
>>> such
>>> that the idempotency requirements of the protocol are not met.
>>>
>>
>> Yes, it is an option.
>>
>> Best regards,
>> Rozita
>>
>>> Cheers.
>>> Simeon
>>>
>>>
>>>
>>> Fridman, Rozita wrote:
>>>> Hello all,
>>>>
>>>> we developed an OAI-Provider for Escidoc repositories.
>>>> Escidoc-OAI-Provider is based on the Fedora-OAI-Provider, which uses
>>> a
>>>> cache to reduce a response time. Escidoc repositories intend to
>>> contain
>>>> multiple millions of objects. The Escidoc-Core framework only
>>> requires
>>>> that objects metadata stored in a Escidoc repository are well formed
>>>> xml-structures. Therefore using of a cache in the Escidoc-OAI-
>>> Provider
>>>> is essential to ensure validness of metadata in OAI-PMH response and
>>> an
>>>> acceptable response time.
>>>>
>>>> But the current OAI-PMH protocol specification doesn't account for
>>> some
>>>> issues, caused by the employment of a cache.
>>>>
>>>> The main problem is a time lag between a harvester request and a last
>>>> cache update:
>>>> A harvester asks the OAI-Provider for all records that have changed
>>>> between T0 and T2 in the underlying repository. The last cache update
>>>> was at T1.The harvester gets records that have changed between T0 and
>>>> T1, but assumes that it got all changes between T0 and T2. Therefore
>>> in
>>>> the next request it asks for records that have changed between T2 and
>>> T3
>>>> and is missing all changes between T1 and T2. If cache update
>>> interval
>>>> is long and the next cache update takes place after T3, the harvester
>>> is
>>>> also missing all changes between T2 and T3 and so on.
>>>>
>>>> One proposal would be to put a date stamp of the last cache update
>>> into
>>>> the OAI-PMH response, in order to inform a harvester about possibly
>>>> missed records.
>>>>
>>>> Does anybody face the same problem? What do you think about it? Maybe
>>>> there are better solutions for this problem?
>>>>
>>>> The other issue is that depending on the OAI-Provider implementation
>>> a
>>>> cache may be in an inconsistent state while a cache update process is
>>>> running. Are there means in the OAI-PMH protocol to respond to
>>> harvester
>>>> requests during a cache update? A possible solution would be to
>>> respond
>>>> with a HTTP-status code 503-Service unavailable (section 3.1.2.2 of
>>> the
>>>> specification), but the problem is to specify Retry-After period. A
>>>> duration of the cache update is not constant, it depends on the
>>> changes
>>>> in the repository.
>>>>
>>>> Thanks a lot,
>>>> Rozita
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>> ---
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------
>>>>
>>>> Fachinformationszentrum Karlsruhe, Gesellschaft für
>>> wissenschaftlich-technische Information mbH.
>>>> Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim
>>> HRB 101892.
>>>> Geschäftsführerin: Sabine Brünger-Weilandt.
>>>> Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>> ---
>>>>
>>>> _______________________________________________
>>>> OAI-implementers mailing list
>>>> List information, archives, preferences and to unsubscribe:
>>>> http://www.openarchives.org/mailman/listinfo/oai-implementers
>>>>
>>
>>
>>
>> -------------------------------------------------------
>>
>> Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische Information mbH.
>> Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 101892.
>> Geschäftsführerin: Sabine Brünger-Weilandt.
>> Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
>>
>>
>>
>> _______________________________________________
>> OAI-implementers mailing list
>> List information, archives, preferences and to unsubscribe:
>> http://www.openarchives.org/mailman/listinfo/oai-implementers
>>
>
> ----
> Michael L. Nelson mln at cs.odu.edu http://www.cs.odu.edu/~mln/
> Dept of Computer Science, Old Dominion University, Norfolk VA 23529
> +1 757 683 6393 +1 757 683 4900 (f)


More information about the OAI-implementers mailing list