[OAI-implementers] RE: [OAI-general] OAI interop site back up - 76 of 126 sites failing

Thomas G. Habing thabing@uiuc.edu
Mon, 11 Nov 2002 15:50:53 -0600


Hi all,

I haven't tried harvesting XTCat, but the strategy we have employed in our
harvester to deal with miscellaneous failures has been to retry failed
requests multiple times, resubmitting the previous resumptionToken, and
exponentially increasing the wait time between retries (after the fifth
retry the wait is approx. 25 minutes) until our retry limit (5) has been
reached at which point we throw an error and give up, or until the request
is handled successfully.

We have found this to work pretty well, especially when harvesting very
large repositories.  For some repositories this seems to be the only way to
successfully harvest them in entirety.

Regards,
	Tom

-- 
Thomas Habing
Research Programmer, Digital Library Projects
University of Illinois at Urbana-Champaign
155 Grainger Engineering Library Information Center, MC-274
thabing@uiuc.edu, (217) 244-4425
http://oai.grainger.uiuc.edu

"Young,Jeff" wrote:
> 
> I hope Alan doesn't mind, but I'm redirecting this to the oai-implementers
> group.
> 
> I see that Alan hasn't been able to do a complete harvest of our XTCat
> repository yet. I also know that he isn't alone in this failure. One
> possible reason for this might be that the XTCat repository itself is flaky.
> Another possibility, though, is that people are writing harvesters that
> ignore the potential for unavoidable problems when harvesting large
> repositories. With 4+ million records, XTCat is probably the largest
> repository out there by far. My concern, then, is that OAI seems to be fine
> for all the "toy" ;-) repositories out there now, but people planning large
> repositories should be skeptical about its efficacy.
> 
> I wrote some suggestions on the XTCat home page
> (http://alcme.oclc.org/xtcat/index.html) that I think can help, but I doubt
> that anyone has adopted any of these suggestions. These suggestions are to:
> 1) retry failed requests a few times before giving up, 2) retry later with
> the last resumptionToken rather than restart from scratch, and 3) write the
> responses to a file for later processing rather than perform synchronous
> updates to a database.
> 
> Even if XTCat is occasionally flaky, which may or may not be the case, these
> actions can greatly mitigate any problems that occur.
> 
> Jeff
> 
> > -----Original Message-----
> > From: Alan Kent [mailto:ajk@mds.rmit.edu.au]
> > Sent: Sunday, November 10, 2002 8:38 PM
> > To: OAI-general@oaisrv.nsdl.cornell.edu
> > Subject: [OAI-general] OAI interop site back up - 76 of 126 sites
> > failing
> >
> >
> > Hi all,
> >
> > I tried to be clever when upgrading the test collection I build,
> > and of course managed to mess it up completely. Rather than
> > restore from backup, I decided to start afresh to clean up the
> > logs etc too.
> >
> > So I now have a fresh new database harvested from 126
> > other sites, with log messages of things that went wrong for
> > 76 of the sites. The datbase now has around 1.8 million records.
> >
> > If you are a data provider and want to check to see if your
> > collection worked, have a look at
> >
> >     http://www.teratext.com.au:8123/public/log;collection=OAI
> >
> > There is a drop down list of all the data provider names. I am afraid
> > I fiddled with many of the names to group them more nicely in my list.
> > In hind site, this was probably a mistake. They are not
> > repository names,
> > they are not the names listed on the www.openarchives.org site.
> >
> > If you want to go straight to your site, put your name from the list
> > below at the end of the following URL (remove the '...')
> >
> >
> > http://www.teratext.com.au:8123/public/log;collection=OAI;data
> > Source=...
> >
> > (Note: you can navigate from the home page
> > http://www.teratext.com.au:8123/
> > to the above URLs - just trying to save people time.)
> >
> > Here is a list of sites that I think are working:
> >
> >     CCSDthesis
> >     CSTC.org
> >     NUIM
> >     RIACS
> >     UBC.ca
> >     UUdiva
> >     anu.edu.au
> >     arXiv.org
> >     archiveSIC.ccsd.cnrs.fr
> >     archives.anlc.uaf.edu
> >     biomedcentral.com-bmc
> >     caltechEERL.library.caltech.edu
> >     caltechETD
> >     caltechcstr.library.caltech.edu
> >     cav2001.library.caltech.edu
> >     cdlib.org-CDLCIAS
> >     cdlib.org-CDLDERM
> >     cdlib.org-CDLTC
> >     cdlib.org-cdlib1
> >     citebase.eprints.org
> >     diglib.lib.auburn.edu
> >     eldorado.uni-dortmund.de
> >     enc.org
> >     epub.wu-wein.ac.at
> >     ethnologue.com-sil
> >     ethnologue.com
> >     formations2.ulst.ac.uk
> >     hofprints.hofstra.edu
> >     hray.com-ackarch
> >     infomotions.com
> >     jeanNicod.ccsd.cnrs.fr
> >     lacito.archivage.vjf.cnrs.fr
> >     linguistics.berkeley.edu-cbold
> >     mpi.nl
> >     nottingham.ac.uk
> >     numismatics.org-ans
> >     pastel.paristech.org
> >     perseus.tufts.edu
> >     physdoc
> >     rdn.ac.uk
> >     sammelpunkt.philo.at
> >     ston.jsc.nasa.gov
> >     tkn
> >     uni-duisburg.de-DUETT
> >     unimelb.edu.au-UMER
> >     upenn.edu-celebration
> >     upenn.edu-sceti
> >     usu.edu-GenericEPrints
> >     vt.edu-JCDLPix
> >     vt.edu-ncstrlh
> >
> > Here is a list of sites I am having trouble with (could be my
> > bug). These
> > include network timeout problems and sites that might not
> > exist any more.
> > (Eg: vt.edu-ENUMERATE says connection refused.) Please let me
> > know if your
> > site should be removed from the list.
> >
> >     CEIAT
> >     CULeuclid-test
> >     CULeuclid
> >     ELibBSU
> >     HUBerlin.de
> >     LSUETD
> >     MONARCH
> >     UDLAthesis
> >     USF.edu
> >     aim25.ac.uk
> >     aisri.indiana.edu
> >     alcme.oclc.org-etdcat
> >     arizona.edu-GROW
> >     asterix.lib.hku.hk-HKUTO
> >     bis.uni-oldenburg.de
> >     chemweb.com-CPS
> >     cimi.org
> >     cogprints.soton.ac.uk
> >     conoze.com
> >     cornell.edu-NSDL-DEV-CU
> >     davidrumsey.com:8080
> >     dispute.library.uu.nl
> >     diss-epsilon.slu.se
> >     edu.ioffe.ru
> >     elib.suub.uni-bremen.de
> >     emerge-dev.ncsa.uiuc.edu:80
> >     eprints-dev.osti.gov
> >     eprints.ecs.soton.ac.uk
> >     glasgow-eprints
> >     hbllmedia.lib.byu.edu
> >     hsss.slub-dresden.de
> >     ibiblio.org
> >     in2p3.fr
> >     indiana.edu-DLCommons
> >     infsearch.cs.cmu.edu
> >     language-archives.org-AlanTest
> >     language-archives.org-EarlyMandarin
> >     language-archives.org-Formosan
> >     language-archives.org-SinicaCorpus
> >     language-archives.org-applebytest
> >     language-archives.org-cogdata
> >     language-archives.org-scoil
> >     language-archives.org-stevenbird
> >     language-archives.org-talkbank
> >     lcoa1.loc.gov
> >     lib.umich.edu
> >     ltrs.larc.nasa.gov
> >     mathpreprints.com
> >     naca.larc.nasa.gov
> >     open-video.org
> >     ota.ahds.ac.uk
> >     pkp.ubc.ca
> >     repec.openlib.org
> >     scout.cs.wisc.edu
> >     sunsite.utk.edu
> >     techreports.larc.nasa.gov
> >     theses.mit.edu
> >     torc9.cs.utk.edu
> >     ub.rug.nl
> >     uiuc.edu
> >     ukoln.ac.uk
> >     umn.edu-UMIMAGES
> >     uni-tuebingen.de
> >     univ-lyon2.fr-ArchiveLyon2
> >     univ-lyon2.fr-CyberTheses
> >     upenn.edu-ATILF
> >     upenn.edu-LDC
> >     upenn.edu-aps
> >     upenn.edu-dfki
> >     upenn.edu-elra
> >     vt.edu-ENUMERATE
> >     vt.edu-ETDIndividuals
> >     vt.edu-UKETD
> >     vt.edu-VTETD
> >     vt.edu-ndltdpapers
> >     xtcat.oclc.org
> >
> > Alan
> >
> > ps: There is a query interface if anyone is interested, but there are
> > other collections around so I am guessing probably not. (A query
> > for 'the' across some of the fields found 690,000 records in
> > 0.42 secs locally - no network delay.)
> >
> > _______________________________________________
> > OAI-general mailing list
> > OAI-general@oaisrv.nsdl.cornell.edu
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-general
> >
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers