[OAI-implementers] Using dates other than metadata record creation date for data provider "from" and "until" searches

Kat Hagedorn khage at umich.edu
Fri Apr 11 12:29:08 EDT 2008


Hi Rob,

But your harvester handles resumptionTokens, right? So, it's not 1 x 200,000
requests if those 200,000 are broken into rTs of 500 each, say. I'm
interested in how your harvesting system is set up that makes it easier to
harvest by multiple bounded date requests rather than one single request.

We use a homegrown solution (http://sourceforge.net/umoaitoolkit) which
we've tweaked as time goes on to handle server outages and such, but we
don't have trouble with most of our large harvests, e.g., HighWire has over
a million and as a rule takes about 8 hours to run from scratch.

Regards,
 -Kat


On 4/11/08 12:20 PM, "Rob Tice" <rob.tice at k-int.com> wrote:

> Hi Kat
> 
> I am not saying that harvesting software shouldn't be robust or be able to
> recover from errors (where possible). However, I still reckon that
> performing some (pretty rudimentary) data analysis and adhering to a few
> basic principles when exposing your data, can make harvesting easier.
> 
> In my experience as a harvester (when dealing with large collections), I
> would much rather set up an instruction to sequentially issue (for example)
> 50 date bounded requests (starting with the earliest date stamp from the
> identify response). 50 x 4000 is much better than 1 x 200,000 IMHO :)
> 
> Cheers
> 
> Rob
> 
> 
> 
>> -----Original Message-----
>> From: Kat Hagedorn [mailto:khage at umich.edu]
>> Sent: 11 April 2008 15:01
>> To: Rob Tice; lisa at issuelab.com
>> Cc: oai-implementers at openarchives.org
>> Subject: Re: [OAI-implementers] Using dates other than metadata record
>> creation date for data provider "from" and "until" searches
>> 
>> I¹m not sure I agree. Any robust harvester software should be able to
>> *initially* harvest regardless of datestamp.
>> 
>> Problems such as network connectivity, etc. should be addressed in the
>> harvesting software to allow as close to seamless harvesting as
>> possible.
>> For example, software should have a timeout feature that provides for
>> network issues (e.g., waits 1 minute before trying again). If robust
>> software is not able to handle these problems, it¹s most likely an
>> issue on
>> the data provider side, in my experience.
>> 
>> Regards,
>>  -Kat
>> 
>> 
>> On 4/11/08 5:53 AM, "Rob Tice" <rob.tice at k-int.com> wrote:
>> 
>>> Hi Lisa
>>> 
>>> I find that it is always helpful to put yourself in the shoes of
>> someone who
>>> wants to harvest your data.
>>> 
>>> If the smallest record count that can be obtained from your system as
>> the
>>> result of an initial OAI ListRecords request (including dates and/or
>> sets) is
>>> very large, it can be quite difficult for harvesting systems to
>> successfully
>>> complete an Œinitial population¹ from your data without other
>> influences
>>> (network  connectivity, target response time, resumption token
>> lifetime etc.)
>>> having an increasing bearing on the successful outcome of the
>> harvest.
>>> 
>>> For example, having a repository containing 200,000 records, all
>> dated the
>>> same day and not supporting a request granularity of less than 1 day
>> makes
>>> initial population more difficult for any harvesting system J.
>>> 
>>> I do not know how many records you have so this may not be an issue
>> for you
>>> but I think it is worth bearing in mind.
>>> 
>>> Cheers
>>> 
>>> Rob
>>> 
>>> 
>>> 
>>> 
>>> From: oai-implementers-bounces at openarchives.org
>>> [mailto:oai-implementers-bounces at openarchives.org] On Behalf Of
>> Frederic
>>> MERCEUR
>>> Sent: 11 April 2008 07:41
>>> To: lisa at issuelab.com
>>> Cc: oai-implementers at openarchives.org
>>> Subject: Re: [OAI-implementers] Using dates other than metadata
>> record
>>> creation date for data provider "from" and "until" searches
>>> 
>>> Hello,
>>> As far as I understand the OAI protocol, I would rather say that
>> DateStamp is
>>> about the last time that your record has been updated (which then
>> must reflect
>>> "create", "update" or "delete").
>>> When you will first register your archive into a Harvester, I guess
>> the
>>> harvester will first get all records available. To do so, it will
>> query your
>>> archive without the "from" and "to" parameter.
>>> Then, most of harvesters will run regularly some incremental
>> harvesting to get
>>> the records modified, deleted or added since the previous harvest. To
>> do so
>>> they will run the query with the "from" parameter.
>>> Kind regards,
>>> Fred
>>> 
>>> 
>>> Lisa M. Brooks a écrit :
>>> Hello - We're very close to launching our data provider. Before we do
>> I have a
>>> question about date-stamps.
>>> 
>>> I understand that the "from" and "until" dates used to request
>> metadata
>>> records refer to the date that the metadata record was created. We
>> are an
>>> archive of research works that date back to the 1980s (we will
>> definitely get
>>> even older works into our archive as we move forward). To my mind it
>> would be
>>> more helpful to folks if our record date-stamps reflect the date the
>> research
>>> work in question was first published.
>>> 
>>> My concern is that we introduce our repository and harvesters don't
>> get the
>>> gist of the temporal scope of our collection because everything is
>>> date-stamped en masse with the date that we generate our metadata
>> records
>>> (which, with luck, will be this Saturday).
>>> 
>>> I hope I'm making sense! Just want to know if this is a big no-no, or
>> if there
>>> are things to consider before doing something like this. Appreciate
>> the
>>> insight of list participants.
>>> 
>>> Thanks for reading -
>>> ~Lisa
>>> 
>>> Lisa M. Brooks
>>> IssueLab - bringing nonprofit research into focus
>>> lisa at issuelab.org
>>> 773-649-1790
>>> http://www.issuelab.org
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> OAI-implementers mailing list
>>> List information, archives, preferences and to unsubscribe:
>>> http://www.openarchives.org/mailman/listinfo/oai-implementers
>>> 
>>> 
>>> 
>> 
>> 
>> -------------------
>> Kat Hagedorn
>> OAIster/Metadata Harvesting Librarian
>> DLXS Bibliographic Class Coordinator
>> Digital Library Production Service
>> University of Michigan
>> 
>> http://www.oaister.org/
>> http://www.dlxs.org/
>> email: khage at umich.edu
>> phone: 734-615-7618
>> 
>> No virus found in this incoming message.
>> Checked by AVG.
>> Version: 7.5.519 / Virus Database: 269.22.12/1372 - Release Date:
>> 10/04/2008 17:36
>> 
> 
> No virus found in this outgoing message.
> Checked by AVG. 
> Version: 7.5.519 / Virus Database: 269.22.12/1372 - Release Date: 10/04/2008
> 17:36
>  
> 

-------------------
Kat Hagedorn 
OAIster/Metadata Harvesting Librarian
DLXS Bibliographic Class Coordinator
Digital Library Production Service
University of Michigan

http://www.oaister.org/
http://www.dlxs.org/
email: khage at umich.edu
phone: 734-615-7618




More information about the OAI-implementers mailing list