[OAI-implementers] Using dates other than metadata record creation date for data provider "from" and "until" searches

Kat Hagedorn khage at umich.edu
Fri Apr 11 15:59:29 EDT 2008


Well, I would never consider myself an optimist, actually.  ;)

But, while I do understand your approach, in my experience there have been
very, very few unrecoverable errors when harvesting large repositories. If
there is an error of this type, it's generally because the data repository
server has failed, and consequently has nothing to do with rTs, persistence
or anything of that ilk. And that will be a problem whether we pick the
first scenario or the second one.

I suppose it comes down to how you've built your software-- I had never
thought of the data-bounded approach, which is an interesting and
potentially less risky method. (Or more risky because certain datestamps may
be missed...to come back full circle.)

 -Kat


On 4/11/08 2:29 PM, "Rob Tice" <rob.tice at k-int.com> wrote:

> Hi Kat
> 
> I am not advocating a different harvesting methodology I am just approaching
> from a different perspective. In my 'OAI world' (and based on my experiences)
> not all targets are perfect and not all errors are recoverable. If all OAI
> targets were created equal and performed faultlessly then there would not be a
> strategy to discuss.
> 
> My experience is that they aren't and that the devil (as with everything) is
> in the implementation.
> 
> So as I see it the difference between 50 x 4000 and 1 x 200,000 is:
> 
> In the first scenario, you have an entirely complete OAI request after every
> 4000 records and any unrecoverable failure (of either the harvester or target)
> results in a re-harvest of 4000 records - worst case rollback 3999 records
> (even after 199,999 have been harvested)
> 
> In the second scenario,  any unrecoverable failure results in a re-harvest of
> 200,000 records -  worst case rollback of 199,999 records. In addition the
> repository has to deal with the burden of persistence and resumption for a
> great deal longer which arguably increases both its workload and the chance of
> failure.
> 
> In our imperfect world and given that I am not a betting man I would always
> choose the first approach if it were available to me.
> 
> Does this make me a pessimist :) ?
> 
> 
> Cheers
> 
> Rob
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> -----Original Message-----
>> From: Kat Hagedorn [mailto:khage at umich.edu]
>> Sent: 11 April 2008 17:29
>> To: Rob Tice
>> Cc: oai-implementers at openarchives.org
>> Subject: Re: [OAI-implementers] Using dates other than metadata record
>> creation date for data provider "from" and "until" searches
>> 
>> Hi Rob,
>> 
>> But your harvester handles resumptionTokens, right? So, it's not 1 x
>> 200,000
>> requests if those 200,000 are broken into rTs of 500 each, say. I'm
>> interested in how your harvesting system is set up that makes it easier
>> to
>> harvest by multiple bounded date requests rather than one single
>> request.
>> 
>> We use a homegrown solution (http://sourceforge.net/umoaitoolkit) which
>> we've tweaked as time goes on to handle server outages and such, but we
>> don't have trouble with most of our large harvests, e.g., HighWire has
>> over
>> a million and as a rule takes about 8 hours to run from scratch.
>> 
>> Regards,
>>  -Kat
>> 
>> 
>> On 4/11/08 12:20 PM, "Rob Tice" <rob.tice at k-int.com> wrote:
>> 
>>> Hi Kat
>>> 
>>> I am not saying that harvesting software shouldn't be robust or be
>> able to
>>> recover from errors (where possible). However, I still reckon that
>>> performing some (pretty rudimentary) data analysis and adhering to a
>> few
>>> basic principles when exposing your data, can make harvesting easier.
>>> 
>>> In my experience as a harvester (when dealing with large
>> collections), I
>>> would much rather set up an instruction to sequentially issue (for
>> example)
>>> 50 date bounded requests (starting with the earliest date stamp from
>> the
>>> identify response). 50 x 4000 is much better than 1 x 200,000 IMHO :)
>>> 
>>> Cheers
>>> 
>>> Rob
>>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Kat Hagedorn [mailto:khage at umich.edu]
>>>> Sent: 11 April 2008 15:01
>>>> To: Rob Tice; lisa at issuelab.com
>>>> Cc: oai-implementers at openarchives.org
>>>> Subject: Re: [OAI-implementers] Using dates other than metadata
>> record
>>>> creation date for data provider "from" and "until" searches
>>>> 
>>>> I¹m not sure I agree. Any robust harvester software should be able
>> to
>>>> *initially* harvest regardless of datestamp.
>>>> 
>>>> Problems such as network connectivity, etc. should be addressed in
>> the
>>>> harvesting software to allow as close to seamless harvesting as
>>>> possible.
>>>> For example, software should have a timeout feature that provides
>> for
>>>> network issues (e.g., waits 1 minute before trying again). If robust
>>>> software is not able to handle these problems, it¹s most likely an
>>>> issue on
>>>> the data provider side, in my experience.
>>>> 
>>>> Regards,
>>>>  -Kat
>>>> 
>>>> 
>>>> On 4/11/08 5:53 AM, "Rob Tice" <rob.tice at k-int.com> wrote:
>>>> 
>>>>> Hi Lisa
>>>>> 
>>>>> I find that it is always helpful to put yourself in the shoes of
>>>> someone who
>>>>> wants to harvest your data.
>>>>> 
>>>>> If the smallest record count that can be obtained from your system
>> as
>>>> the
>>>>> result of an initial OAI ListRecords request (including dates
>> and/or
>>>> sets) is
>>>>> very large, it can be quite difficult for harvesting systems to
>>>> successfully
>>>>> complete an Œinitial population¹ from your data without other
>>>> influences
>>>>> (network  connectivity, target response time, resumption token
>>>> lifetime etc.)
>>>>> having an increasing bearing on the successful outcome of the
>>>> harvest.
>>>>> 
>>>>> For example, having a repository containing 200,000 records, all
>>>> dated the
>>>>> same day and not supporting a request granularity of less than 1
>> day
>>>> makes
>>>>> initial population more difficult for any harvesting system J.
>>>>> 
>>>>> I do not know how many records you have so this may not be an issue
>>>> for you
>>>>> but I think it is worth bearing in mind.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> Rob
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> From: oai-implementers-bounces at openarchives.org
>>>>> [mailto:oai-implementers-bounces at openarchives.org] On Behalf Of
>>>> Frederic
>>>>> MERCEUR
>>>>> Sent: 11 April 2008 07:41
>>>>> To: lisa at issuelab.com
>>>>> Cc: oai-implementers at openarchives.org
>>>>> Subject: Re: [OAI-implementers] Using dates other than metadata
>>>> record
>>>>> creation date for data provider "from" and "until" searches
>>>>> 
>>>>> Hello,
>>>>> As far as I understand the OAI protocol, I would rather say that
>>>> DateStamp is
>>>>> about the last time that your record has been updated (which then
>>>> must reflect
>>>>> "create", "update" or "delete").
>>>>> When you will first register your archive into a Harvester, I guess
>>>> the
>>>>> harvester will first get all records available. To do so, it will
>>>> query your
>>>>> archive without the "from" and "to" parameter.
>>>>> Then, most of harvesters will run regularly some incremental
>>>> harvesting to get
>>>>> the records modified, deleted or added since the previous harvest.
>> To
>>>> do so
>>>>> they will run the query with the "from" parameter.
>>>>> Kind regards,
>>>>> Fred
>>>>> 
>>>>> 
>>>>> Lisa M. Brooks a écrit :
>>>>> Hello - We're very close to launching our data provider. Before we
>> do
>>>> I have a
>>>>> question about date-stamps.
>>>>> 
>>>>> I understand that the "from" and "until" dates used to request
>>>> metadata
>>>>> records refer to the date that the metadata record was created. We
>>>> are an
>>>>> archive of research works that date back to the 1980s (we will
>>>> definitely get
>>>>> even older works into our archive as we move forward). To my mind
>> it
>>>> would be
>>>>> more helpful to folks if our record date-stamps reflect the date
>> the
>>>> research
>>>>> work in question was first published.
>>>>> 
>>>>> My concern is that we introduce our repository and harvesters don't
>>>> get the
>>>>> gist of the temporal scope of our collection because everything is
>>>>> date-stamped en masse with the date that we generate our metadata
>>>> records
>>>>> (which, with luck, will be this Saturday).
>>>>> 
>>>>> I hope I'm making sense! Just want to know if this is a big no-no,
>> or
>>>> if there
>>>>> are things to consider before doing something like this. Appreciate
>>>> the
>>>>> insight of list participants.
>>>>> 
>>>>> Thanks for reading -
>>>>> ~Lisa
>>>>> 
>>>>> Lisa M. Brooks
>>>>> IssueLab - bringing nonprofit research into focus
>>>>> lisa at issuelab.org
>>>>> 773-649-1790
>>>>> http://www.issuelab.org
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> OAI-implementers mailing list
>>>>> List information, archives, preferences and to unsubscribe:
>>>>> http://www.openarchives.org/mailman/listinfo/oai-implementers
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> -------------------
>>>> Kat Hagedorn
>>>> OAIster/Metadata Harvesting Librarian
>>>> DLXS Bibliographic Class Coordinator
>>>> Digital Library Production Service
>>>> University of Michigan
>>>> 
>>>> http://www.oaister.org/
>>>> http://www.dlxs.org/
>>>> email: khage at umich.edu
>>>> phone: 734-615-7618
>>>> 
>>>> No virus found in this incoming message.
>>>> Checked by AVG.
>>>> Version: 7.5.519 / Virus Database: 269.22.12/1372 - Release Date:
>>>> 10/04/2008 17:36
>>>> 
>>> 
>>> No virus found in this outgoing message.
>>> Checked by AVG.
>>> Version: 7.5.519 / Virus Database: 269.22.12/1372 - Release Date:
>> 10/04/2008
>>> 17:36
>>> 
>>> 
>> 
>> -------------------
>> Kat Hagedorn
>> OAIster/Metadata Harvesting Librarian
>> DLXS Bibliographic Class Coordinator
>> Digital Library Production Service
>> University of Michigan
>> 
>> http://www.oaister.org/
>> http://www.dlxs.org/
>> email: khage at umich.edu
>> phone: 734-615-7618
>> 
>> No virus found in this incoming message.
>> Checked by AVG.
>> Version: 7.5.519 / Virus Database: 269.22.12/1372 - Release Date:
>> 10/04/2008 17:36
>> 
> 
> No virus found in this outgoing message.
> Checked by AVG. 
> Version: 7.5.519 / Virus Database: 269.22.12/1372 - Release Date: 10/04/2008
> 17:36
>  
> 
> 
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://www.openarchives.org/mailman/listinfo/oai-implementers
> 

-------------------
Kat Hagedorn 
OAIster/Metadata Harvesting Librarian
DLXS Bibliographic Class Coordinator
Digital Library Production Service
University of Michigan

http://www.oaister.org/
http://www.dlxs.org/
email: khage at umich.edu
phone: 734-615-7618




More information about the OAI-implementers mailing list