[OAI-implementers] Experience with large-scale harvesting

Hickey,Thom hickey@oclc.org
Fri, 13 Jun 2003 15:59:59 -0400


This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_001_01C331E6.5E75CAB2
Content-Type: text/plain;
	charset="iso-8859-1"

Since creating a one-page Python OAI-PMH harvester (see an improved, even
shorter, version at  http://purl.oclc.org/net/hickey/oai/harvest.py
<http://purl.oclc.org/net/hickey/oai/harvest.py> ) , I've been seeing how
our OAI repositories perform on full harvests.
 
OCLC Research runs two main repositories of metadata about theses and
dissertations:
 
XTCat ( http://alcme.oclc.org/xtcat/ <http://alcme.oclc.org/xtcat/> ) with
some 4.3 million bibliographic records 
NDLTD ( http://alcme.oclc.org/ndltd/ <http://alcme.oclc.org/ndltd/> ) which
has around 38,000 records.
 
My workstation can harvest XTCat in around 90 minutes if compression is used
(over a 10 megabit line).  Without compression it takes at least half again
as long, and my machine is much busier.  I was slightly surprised at the
difference in bytes-received that compression makes:  8:1 for the larger
database and 7:1 for the smaller.
 
Harvesting at home via a cable modem takes slightly less than 4 hours to
harvest the 4.3 million records.  That is about 300 records/second.  Each
record is about 1,000 bytes (uncompressed).
 
The 90 minute harvest is 800 records/second (800,000 bytes/second).  The
best time observed for doing two harvests simultaneously was 120 minutes, or
1,200 records/second.  The most records/second observed was slightly more
than 1,400 records/second when running four simultaneous harvests, probably
close to the maximum rate the repository can support.
 
Running multiple harvests simultaneously did find a weakness in the
repository code, which would occasionally run out of memory.  We seem to
have that fixed now, but I expect that error recovery is important for
reliably accomplishing large harvests.
 
--Th

------_=_NextPart_001_01C331E6.5E75CAB2
Content-Type: text/html;
	charset="iso-8859-1"

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">


<META content="MSHTML 6.00.2600.0" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>Since creating a 
one-page Python OAI-PMH harvester (see an improved,&nbsp;even 
shorter,&nbsp;version at&nbsp;<A 
href="http://purl.oclc.org/net/hickey/oai/harvest.py">http://purl.oclc.org/net/hickey/oai/harvest.py</A>) 
, I've been seeing how our OAI repositories perform on full 
harvests.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN 
class=776144917-13062003></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>OCLC Research runs 
two main repositories of metadata about theses and 
dissertations:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN 
class=776144917-13062003></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>XTCat (<A 
href="http://alcme.oclc.org/xtcat/">http://alcme.oclc.org/xtcat/</A>)&nbsp;with 
some 4.3 million bibliographic records </SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>NDLTD (<A 
href="http://alcme.oclc.org/ndltd/">http://alcme.oclc.org/ndltd/</A>)&nbsp;which 
has around 38,000 records.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN 
class=776144917-13062003></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>My workstation can 
harvest XTCat in around 90 minutes if compression is used (over a 10 megabit 
line).&nbsp; Without compression it takes at least half again as long, and my 
machine is much busier.&nbsp; I was slightly surprised at the difference in 
bytes-received that compression makes:&nbsp; 8:1 for the larger database and 7:1 
for the smaller.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN 
class=776144917-13062003></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>Harvesting at home 
via a cable modem takes slightly less than 4 hours to harvest the 4.3 million 
records.&nbsp; That is about 300 records/second.&nbsp; Each record is about 
1,000 bytes (uncompressed).</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN 
class=776144917-13062003></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>The 90 minute 
harvest is 800 records/second (800,000 bytes/second).&nbsp; The best time 
observed for doing two harvests simultaneously was 120 minutes, or 1,200 
records/second.&nbsp; The most records/second observed was slightly more than 
1,400 records/second when running four simultaneous harvests, probably close to 
the maximum rate&nbsp;the repository can support.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN 
class=776144917-13062003></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>Running multiple 
harvests simultaneously did find a weakness in the repository code, which would 
occasionally run out of memory.&nbsp;&nbsp;We seem to have that fixed now, but I 
expect&nbsp;that error recovery is important for reliably accomplishing large 
harvests.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN 
class=776144917-13062003></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2><SPAN 
class=776144917-13062003>--Th</SPAN></FONT></DIV></BODY></HTML>

------_=_NextPart_001_01C331E6.5E75CAB2--