[OAI-implementers] A 'one-page' harvester in Python

Hickey,Thom hickey@oclc.org
Fri, 6 Jun 2003 16:56:04 -0400


This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C32C6E.0B0B1C34
Content-Type: text/plain;
	charset="iso-8859-1"

I've attached a one page (or at least it was one page until I put our legal
text into it!) Python script that will pull records from a repository and
dump them to a file.  In spite of its length it:

  o Handles resumption tokens
  o Notices OAI errors
  o Supports compression
  o Respects 503 Retry-After's

It doesn't know much about XML, though, so the file created is just a
collection of the downloaded XML responses, and the only metadata format it
asks for is oai_dc, even though it does ask the repository for the metadata
formats supported.  Sets are ignored, but would be fairly easy to add.

I tested it using Python 2.2.2 under Windows 2000 against several
repositories.

It is invoked by:
  python harverst.py [repository-address outputfile]
e.g.:
  python harvest.py alcme.oclc.org/ndltd/servlet/OAIHandler ndltd.out

If you just run the script without parameters it defaults the NDLTD
repository (around 39,000 digital thesis and dissertation records).

Anyway, I thought it was interesting to see how much could be done in less
than 60 lines.

--Th


------_=_NextPart_000_01C32C6E.0B0B1C34
Content-Type: application/octet-stream;
	name="harvest.py"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="harvest.py"

import sys, urllib2, zlib, time, re=0A=
## Copyright (c) 2000-2003 OCLC Online Computer Library Center, Inc. =
and other=0A=
## contributors.  All rights reserved.  The contents of this file, as =
updated=0A=
## from time to time by the OCLC Office of Research are subject to =
OCLC=0A=
## Research Public License Version 2.0 (the "License"); you may not use =
this=0A=
## file except in compliance with the License.  You may obtain a =
current copy=0A=
## of the License at http://purl.oclc.org/oclc/research/ORPL/.  =
Software=0A=
## distributed under the License is distributed on an "AS IS" basis, =
WITHOUT=0A=
## WARRANTY OF ANY KIND, either express or implied.  See the License =
for the=0A=
## specific language governing rights and limitations under the =
License.  This=0A=
## software consists of voluntary contributions made by many =
individuals on=0A=
## behalf of OCLC Research.  For more information on OCLC Research, =
please see=0A=
## http://www.oclc.org/research/.  This is the Original Code.  The =
Initial=0A=
## Developer of the Original Code is Thomas Hickey =
(mailto:hickey@oclc.org).=0A=
## Portions created by OCLC are Copyright (C) 2003.  All Rights =
Reserved.=0A=
=0A=
def getResumptionToken(data):=0A=
    mo =3D re.search('<resumptionToken[^>]*>(.*)</resumptionToken>', =
data)=0A=
    if mo: return mo.group(1)=0A=
def getFile(serverString, command, verbose=3D1):=0A=
    remoteAddr =3D serverString+'?verb=3D%s'%command=0A=
    if verbose: print "getFile '%s'"%remoteAddr=0A=
    headers =3D {'User-Agent': 'OAIHarvester/2.0',=0A=
               'Accept': 'text/html',=0A=
               'Accept-Encoding': 'compress, deflate'}=0A=
    try:=0A=
        req =3D urllib2.Request(remoteAddr, None, headers)=0A=
        remoteFile =3D urllib2.urlopen(req)=0A=
        remoteData =3D remoteFile.read()=0A=
        remoteFile.close()=0A=
    except urllib2.HTTPError, exValue:=0A=
        if exValue.code=3D=3D503:=0A=
            retryWait =3D int(exValue.hdrs.get("Retry-After", "-1"))=0A=
            if retryWait<0: return None=0A=
            print 'Waiting %d seconds'%retryWait=0A=
            time.sleep(retryWait)=0A=
            return getFile(serverString, command, 0)=0A=
        print exValue=0A=
        return None=0A=
    try:=0A=
        remoteData =3D zlib.decompressobj().decompress(remoteData)=0A=
    except:=0A=
        pass=0A=
    mo =3D re.search('<error *code=3D\"([^"]*)">(.*)</error>', =
remoteData)=0A=
    if mo:=0A=
        print >>sys.stderr,"OAIERROR: code=3D%s '%s'"%(mo.group(1), =
mo.group(2))=0A=
        sys.exit(1)=0A=
    return remoteData=0A=
def writeWithLF(ofile, data):=0A=
    if not data: return=0A=
    ofile.write(data)=0A=
    if data[-1]!=3D'\n': ofile.write('\n')=0A=
def writeRecords(outFile, serverString, mdformat, sDate=3DNone, =
eDate=3DNone):=0A=
    if not sDate and not eDate:=0A=
        verb=3D'ListRecords&metadataPrefix=3D%s'%(mdformat)=0A=
    else:=0A=
        =
verb=3D'ListRecords&metadataPrefix=3D%s&from=3D%s&until=3D%s'%(mdformat,=
 sDate, eDate)=0A=
    data =3D getFile(serverString, verb)=0A=
    while data:=0A=
        writeWithLF(outFile, data)=0A=
        reTok =3D getResumptionToken(data)=0A=
        if not reTok: break=0A=
        data =3D getFile(serverString, =
"ListRecords&resumptionToken=3D%s"%reTok)=0A=
if __name__=3D=3D"__main__":=0A=
    try:    serverName, outName =3D sys.argv[1:]=0A=
    except: serverName, outName =3D =
'alcme.oclc.org/ndltd/servlet/OAIHandler', 'harvest.out'=0A=
    serverString =3D 'http://%s'%serverName=0A=
    print "Writing to file %s from archive at %s"%(outName, =
serverName)=0A=
    outFile =3D file(outName, 'wb')=0A=
    writeWithLF(outFile, getFile(serverString, 'Identify'))=0A=
    writeWithLF(outFile, getFile(serverString, =
'ListMetadataFormats'))=0A=
    writeRecords(outFile, serverString, 'oai_dc')=0A=

------_=_NextPart_000_01C32C6E.0B0B1C34--