[OAI-implementers] RE: [OAI-general] OAI interop site back up - 76 of 126 sites fail ing

Young,Jeff jyoung@oclc.org
Mon, 11 Nov 2002 10:28:33 -0500


I hope Alan doesn't mind, but I'm redirecting this to the oai-implementers
group.

I see that Alan hasn't been able to do a complete harvest of our XTCat
repository yet. I also know that he isn't alone in this failure. One
possible reason for this might be that the XTCat repository itself is flaky.
Another possibility, though, is that people are writing harvesters that
ignore the potential for unavoidable problems when harvesting large
repositories. With 4+ million records, XTCat is probably the largest
repository out there by far. My concern, then, is that OAI seems to be fine
for all the "toy" ;-) repositories out there now, but people planning large
repositories should be skeptical about its efficacy.

I wrote some suggestions on the XTCat home page
(http://alcme.oclc.org/xtcat/index.html) that I think can help, but I doubt
that anyone has adopted any of these suggestions. These suggestions are to:
1) retry failed requests a few times before giving up, 2) retry later with
the last resumptionToken rather than restart from scratch, and 3) write the
responses to a file for later processing rather than perform synchronous
updates to a database.

Even if XTCat is occasionally flaky, which may or may not be the case, these
actions can greatly mitigate any problems that occur.

Jeff

> -----Original Message-----
> From: Alan Kent [mailto:ajk@mds.rmit.edu.au]
> Sent: Sunday, November 10, 2002 8:38 PM
> To: OAI-general@oaisrv.nsdl.cornell.edu
> Subject: [OAI-general] OAI interop site back up - 76 of 126 sites
> failing
> 
> 
> Hi all,
> 
> I tried to be clever when upgrading the test collection I build,
> and of course managed to mess it up completely. Rather than
> restore from backup, I decided to start afresh to clean up the
> logs etc too.
> 
> So I now have a fresh new database harvested from 126
> other sites, with log messages of things that went wrong for
> 76 of the sites. The datbase now has around 1.8 million records.
> 
> If you are a data provider and want to check to see if your
> collection worked, have a look at
> 
>     http://www.teratext.com.au:8123/public/log;collection=OAI
> 
> There is a drop down list of all the data provider names. I am afraid
> I fiddled with many of the names to group them more nicely in my list.
> In hind site, this was probably a mistake. They are not 
> repository names,
> they are not the names listed on the www.openarchives.org site.
> 
> If you want to go straight to your site, put your name from the list
> below at the end of the following URL (remove the '...')
> 
>     
> http://www.teratext.com.au:8123/public/log;collection=OAI;data
> Source=...
> 
> (Note: you can navigate from the home page 
> http://www.teratext.com.au:8123/
> to the above URLs - just trying to save people time.)
> 
> Here is a list of sites that I think are working:
> 
>     CCSDthesis
>     CSTC.org
>     NUIM
>     RIACS
>     UBC.ca
>     UUdiva
>     anu.edu.au
>     arXiv.org
>     archiveSIC.ccsd.cnrs.fr
>     archives.anlc.uaf.edu
>     biomedcentral.com-bmc
>     caltechEERL.library.caltech.edu
>     caltechETD
>     caltechcstr.library.caltech.edu
>     cav2001.library.caltech.edu
>     cdlib.org-CDLCIAS
>     cdlib.org-CDLDERM
>     cdlib.org-CDLTC
>     cdlib.org-cdlib1
>     citebase.eprints.org
>     diglib.lib.auburn.edu
>     eldorado.uni-dortmund.de
>     enc.org
>     epub.wu-wein.ac.at
>     ethnologue.com-sil
>     ethnologue.com
>     formations2.ulst.ac.uk
>     hofprints.hofstra.edu
>     hray.com-ackarch
>     infomotions.com
>     jeanNicod.ccsd.cnrs.fr
>     lacito.archivage.vjf.cnrs.fr
>     linguistics.berkeley.edu-cbold
>     mpi.nl
>     nottingham.ac.uk
>     numismatics.org-ans
>     pastel.paristech.org
>     perseus.tufts.edu
>     physdoc
>     rdn.ac.uk
>     sammelpunkt.philo.at
>     ston.jsc.nasa.gov
>     tkn
>     uni-duisburg.de-DUETT
>     unimelb.edu.au-UMER
>     upenn.edu-celebration
>     upenn.edu-sceti
>     usu.edu-GenericEPrints
>     vt.edu-JCDLPix
>     vt.edu-ncstrlh
> 
> Here is a list of sites I am having trouble with (could be my 
> bug). These
> include network timeout problems and sites that might not 
> exist any more.
> (Eg: vt.edu-ENUMERATE says connection refused.) Please let me 
> know if your
> site should be removed from the list.
> 
>     CEIAT
>     CULeuclid-test
>     CULeuclid
>     ELibBSU
>     HUBerlin.de
>     LSUETD
>     MONARCH
>     UDLAthesis
>     USF.edu
>     aim25.ac.uk
>     aisri.indiana.edu
>     alcme.oclc.org-etdcat
>     arizona.edu-GROW
>     asterix.lib.hku.hk-HKUTO
>     bis.uni-oldenburg.de
>     chemweb.com-CPS
>     cimi.org
>     cogprints.soton.ac.uk
>     conoze.com
>     cornell.edu-NSDL-DEV-CU
>     davidrumsey.com:8080
>     dispute.library.uu.nl
>     diss-epsilon.slu.se
>     edu.ioffe.ru
>     elib.suub.uni-bremen.de
>     emerge-dev.ncsa.uiuc.edu:80
>     eprints-dev.osti.gov
>     eprints.ecs.soton.ac.uk
>     glasgow-eprints
>     hbllmedia.lib.byu.edu
>     hsss.slub-dresden.de
>     ibiblio.org
>     in2p3.fr
>     indiana.edu-DLCommons
>     infsearch.cs.cmu.edu
>     language-archives.org-AlanTest
>     language-archives.org-EarlyMandarin
>     language-archives.org-Formosan
>     language-archives.org-SinicaCorpus
>     language-archives.org-applebytest
>     language-archives.org-cogdata
>     language-archives.org-scoil
>     language-archives.org-stevenbird
>     language-archives.org-talkbank
>     lcoa1.loc.gov
>     lib.umich.edu
>     ltrs.larc.nasa.gov
>     mathpreprints.com
>     naca.larc.nasa.gov
>     open-video.org
>     ota.ahds.ac.uk
>     pkp.ubc.ca
>     repec.openlib.org
>     scout.cs.wisc.edu
>     sunsite.utk.edu
>     techreports.larc.nasa.gov
>     theses.mit.edu
>     torc9.cs.utk.edu
>     ub.rug.nl
>     uiuc.edu
>     ukoln.ac.uk
>     umn.edu-UMIMAGES
>     uni-tuebingen.de
>     univ-lyon2.fr-ArchiveLyon2
>     univ-lyon2.fr-CyberTheses
>     upenn.edu-ATILF
>     upenn.edu-LDC
>     upenn.edu-aps
>     upenn.edu-dfki
>     upenn.edu-elra
>     vt.edu-ENUMERATE
>     vt.edu-ETDIndividuals
>     vt.edu-UKETD
>     vt.edu-VTETD
>     vt.edu-ndltdpapers
>     xtcat.oclc.org
> 
> Alan
> 
> ps: There is a query interface if anyone is interested, but there are
> other collections around so I am guessing probably not. (A query
> for 'the' across some of the fields found 690,000 records in
> 0.42 secs locally - no network delay.)
> 
> _______________________________________________
> OAI-general mailing list
> OAI-general@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-general
>