[OAI-implementers] Simple XSTL OAI harvester

Young,Jeff jyoung@oclc.org
Sun, 16 Nov 2003 13:58:06 -0500


This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C3AC73.91F6DF3A
Content-Type: text/plain;
	charset="iso-8859-1"

It occurred to me on Friday that it would be possible to write an extremely
simple OAI harvester using XSTL, so I skipped my morning break and wrote one
that I am now making available as open-source. The result is the
"xoaiharvester.xsl" stylesheet (attached), which is only 80 lines long
(excluding the disclaimer). (It only supports OAI-PMH v2.0 repositories for
now. Let me know if you want to use it with other versions.)

It is driven from a "harvest.pl" PERL script (attached) that is about 25
lines long (excluding the disclaimer). This PERL script is also responsible
for managing the from/until dates one needs for incremental harvesting. If
you don't like PERL, you could rewrite it in any other scripting language
pretty easily. To run it the way I have it set up, you will need to have
Java installed and the xalan.jar file in your classpath. If you don't like
Java or Xalan, you should be able to make minor changes to the PERL script
to invoke the XSTL transformation in some other way.

To operate it, you need a configuration file for each repository to be
harvested. Attached is an "xtcat.oclc.org.xml" file to use as an example.
The "baseURL" and "metadataPrefix" elements are required, but the "set"
element is optional.

The command to run it could then be placed in a cron job to perform the
incremental harvest:

perl harvest.py xtcat.oclc.org

The result is a file named "xtcat.oclc.org.YYYY-MM-DD.xml" (contains the
harvest results) and "xtcat.oclc.org.lastHarvested" (contains the date to
use for the next incremental harvest). Doing something with the results file
is left as an exercise for the user. :-)

Now, I'm thinking I could create an OAI repository using XSLT that is almost
this simple. My boss, Thom Hickey, wrote an OAI repository with only 2 pages
of (rather dense) Python code. I'm thinking maybe I can beat this and even
have some white space to spare. :-) I'll post it if I get some spare time to
try it.

Jeff

 <<xoaiharvester.xsl>>  <<xtcat.oclc.org.xml>>  <<harvest.pl>> 

---
Jeffrey A. Young
Software Architect
Office of Research, Mail Code 710
OCLC Online Computer Library Center, Inc.
6565 Frantz Road
Dublin, OH   43017-3395
www.oclc.org

Voice:	614-764-4342
Voice:	800-848-5878, ext. 4342
Fax:	614-718-7477
Email:	jyoung@oclc.org



------_=_NextPart_000_01C3AC73.91F6DF3A
Content-Type: application/octet-stream;
	name="xoaiharvester.xsl"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="xoaiharvester.xsl"

<?xml version=3D"1.0" encoding=3D"utf-8"?>=0A=
=0A=
<!--=0A=
     Copyright (c) 2000-2003 OCLC Online Computer Library Center,=0A=
     Inc. and other contributors. All rights reserved. The contents =
of=0A=
     this file, as updated from time to time by the OCLC Office of=0A=
     Research are subject to OCLC Research Public License Version =
2.0=0A=
     (the "License"); you may not use this file except in compliance=0A=
     with the License.  You may obtain a current copy of the License=0A=
     at http://purl.org/oclc/research/ORPL/. Software distributed=0A=
     under the License is distributed on an "AS IS" basis, WITHOUT=0A=
     WARRANTY OF ANY KIND, either express or implied.  See the =
License=0A=
     for the specific language governing rights and limitations =
under=0A=
     the License.  This software consists of voluntary contributions=0A=
     made by many individuals on behalf of OCLC Research.  For more=0A=
     information on OCLC Research, please see=0A=
     http://www.oclc.org/research/.=0A=
     =0A=
     This is the Original Code.=0A=
     The Initial Developers of the Original Code is Jeffrey Young =
(mailto:jyoung@oclc.org).=0A=
     Portions created by OCLC are Copyright (C) 2003.  All Rights =
Reserved.=0A=
     (version: 2003 November 14)=0A=
-->=0A=
=0A=
<xsl:stylesheet=0A=
  xmlns:xsl=3D"http://www.openarchives.org/OAI/2.0/">=0A=
    <xsl:variable name=3D"resumptionToken"=0A=
      select=3D"oai20:*/oai20:resumptionToken" />=0A=
    =0A=
    <xsl:copy-of select=3D"." />=0A=
=0A=
    <xsl:apply-templates select=3D"oai20:error" />=0A=
    =0A=
    <xsl:if test=3D"$resumptionToken">=0A=
      <xsl:message>=0A=
        <xsl:value-of select=3D"$resumptionToken" />=0A=
      </xsl:message>=0A=
      <xsl:apply-templates select=3D"document(concat(oai20:request,=0A=
                                   '?verb=3D',=0A=
                                   oai20:request/@verb,=0A=
                                   '&amp;resumptionToken=3D',=0A=
                                   $resumptionToken))" />=0A=
    </xsl:if>=0A=
  </xsl:template>=0A=
=0A=
  <!-- report problems -->=0A=
  =0A=
  <xsl:template =
match=3D"oai20:error[not(../oai20:request/@verb=3D'ListSets')]" =
xmlns:oai20=3D"http://www.openarchives.org/OAI/2.0/">=0A=
    <xsl:message>=0A=
      <xsl:value-of select=3D"@code" />=0A=
      <xsl:text> : </xsl:text>=0A=
      <xsl:value-of select=3D"." />=0A=
    </xsl:message>=0A=
  </xsl:template>=0A=
    =0A=
  <!-- strip out stylesheet references -->=0A=
  =0A=
  <xsl:template match=3D"processing-instruction('xml-stylesheet')" =
/>=0A=
  =0A=
</xsl:stylesheet>=0A=

------_=_NextPart_000_01C3AC73.91F6DF3A
Content-Type: application/octet-stream;
	name="xtcat.oclc.org.xml"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="xtcat.oclc.org.xml"

<?xml version=3D"1.0" encoding=3D"UTF-8"?>=0A=
<xoaih:repository =
xmlns:xoaih=3D"http://errol.oclc.org/xmlregistry.oclc.org/xoai/xoaiharve=
ster">=0A=
  =
<xoaih:baseURL>http://alcme.oclc.org/xtcat/servlet/OAIHandler</xoaih:bas=
eURL>=0A=
  <xoaih:metadataPrefix>oai_etdms</xoaih:metadataPrefix>=0A=
  <xoaih:set>ETD</xoaih:set>=0A=
</xoaih:repository>=0A=

------_=_NextPart_000_01C3AC73.91F6DF3A
Content-Type: application/octet-stream;
	name="harvest.pl"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="harvest.pl"

#! /usr/bin/perl=0A=
#     Copyright (c) 2000-2003 OCLC Online Computer Library Center,=0A=
#     Inc. and other contributors. All rights reserved. The contents =
of=0A=
#     this file, as updated from time to time by the OCLC Office of=0A=
#     Research are subject to OCLC Research Public License Version =
2.0=0A=
#     (the "License"); you may not use this file except in =
compliance=0A=
#     with the License.  You may obtain a current copy of the =
License=0A=
#     at http://purl.org/oclc/research/ORPL/. Software distributed=0A=
#     under the License is distributed on an "AS IS" basis, WITHOUT=0A=
#     WARRANTY OF ANY KIND, either express or implied.  See the =
License=0A=
#     for the specific language governing rights and limitations =
under=0A=
#     the License.  This software consists of voluntary =
contributions=0A=
#     made by many individuals on behalf of OCLC Research.  For more=0A=
#     information on OCLC Research, please see=0A=
#     http://www.oclc.org/research/.=0A=
#     =0A=
#     This is the Original Code.=0A=
#     The Initial Developers of the Original Code is Jeffrey Young =
(mailto:jyoung@oclc.org).=0A=
#     Portions created by OCLC are Copyright (C) 2003.  All Rights =
Reserved.=0A=
#     (version: 2003 November 14)=0A=
  =0A=
$lastHarvestedFileName =3D $ARGV[0].".lastHarvested";=0A=
open (LASTHARVESTED, $lastHarvestedFileName);=0A=
$from =3D <LASTHARVESTED>;=0A=
chomp $from;=0A=
print $ARGV[0].": Incremental harvest from ".$from."\n";=0A=
if (length($from) > 0) {=0A=
  $from =3D " -PARAM from '".$from."'";=0A=
}=0A=
close(LASTHARVESTED);=0A=
=0A=
$untilDate =3D `date -u '+%Y-%m-%d'`;=0A=
chomp $untilDate;=0A=
$until =3D " -PARAM until '".$untilDate."'";=0A=
=0A=
$outFileName =3D $ARGV[0].".".$untilDate.".xml";=0A=
=0A=
if (-e $outFileName) {=0A=
  die("ERROR! Output File Exists. Delete '".$outFileName."' before =
proceeding.\n");=0A=
}=0A=
=0A=
system("java org.apache.xalan.xslt.Process -IN ".$ARGV[0].".xml -XSL =
xoaiharvester.xsl -OUT ".$outFileName.$from.$until);=0A=
=0A=
open(LASTHARVESTED, ">".$lastHarvestedFileName);=0A=
print LASTHARVESTED $untilDate;=0A=
close(LASTHARVESTED);=0A=
=0A=

------_=_NextPart_000_01C3AC73.91F6DF3A--