[OAI-implementers] XML encoding problems with DSpace at MIT

Hussein Suleman hussein@vt.edu
Sat, 15 Feb 2003 17:31:36 +0200


hi

i think Tim poses a very relevant question: do we deal with the
so-called "real-world" encoding problems or do we try to encourage
people to fix their implementations? (of course, for research purposes,
we may end up doing both :))

personally, the code i distribute to others does quite a lot of XML
cleaning in the data provider, but none at all in the harvester. i think
the basic philosophy i'm following is: clean data as close to the source
as possible. also, i believe one of the reasons the adminEmail field in
Identify responses is required is so that a service provider can contact
the administrator if there are problems with the data.

and now that the hype about OAI2 is dying down, i wonder how much (if
any) more testing we need. i have some ideas to enhance, complement and
possibly even replace the repository explorer in the next year ... it
all depends on finding time and/or students/colleagues with time :)

ttfn,
----hussein


Tim Brody wrote:
 > 
http://celestial.eprints.org/cgi-bin/status?action=repository;metadataFormat=17 

 >
 >
 > Harvested 752 records - I've also implemented some
 > character-substitution to fix encoding errors, although this is probably
 > not as proficient as Simeon's!
 >
 > The question is, the more harvesters implement fixes the less pressure
 > there is on repositories to fix their output, so should harvesters
 > accept bad-XML?
 > (once that question is answered, harvesters have to decide how much
 > normalisation of metadata they do :-)
 >
 > All the best,
 > Tim.
 >
 > Simeon Warner wrote:
 >
 >> In my recent post to oai-general
 >> 
http://www.openarchives.org/pipermail/oai-general/2003-February/000258.html
 >>
 >> I said I'd post a note about the current output of DSpace at MIT to this
 >> list (which seems a more appropriate forum). I just ran a harvest 
and got
 >> the log shown below, I've added comments in [].
 >>
 >> Cheers,
 >> Simeon.
 >>
 >>
 >>
 >> simeon@ice 14Feb03>more log oaiharvest.pl: Harvest from
 >> http://hpds1.mit.edu/oai/ using POST
 >> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=Identify
 >> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
 >> OAIGet: Got 200 OK (627bytes decoded to 1328bytes)
 >>
 >> [nice, DSpace implements gzip content coding]
 >>
 >> oaiharvest.pl: Identify reports OAI-PMH version 2.0
 >> oaiharvest.pl: Doing complete harvest.
 >> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args:
 >> verb=ListMetadataFormats
 >> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
 >> OAIGet: Got 200 OK (307bytes decoded to 643bytes)
 >> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=ListSets
 >> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
 >> OAIGet: Got 200 OK (715bytes decoded to 2140bytes)
 >> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args:
 >> metadataPrefix=oai_dc&verb=ListRecords
 >> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
 >> OAIGet: Got 200 OK (254096bytes decoded to 1392720bytes)
 >> oaiharvest.pl: UTF-8/XML errors in ListRecords.1:
 >>
 >> [oops, expat parser fails on response
 >>  my harvester now attempts to do replacement on bad XML/UTF8 bytes/chars
 >>  using my utf8conditioner, details at
 >>  http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/
 >>  Unless the response can be parsed we can't even know if there is a
 >>  resumptionToken...]
 >> utf8conditioner: Line 320, char 81453, byte 81491: code not allowed in
 >> XML: 0x000C, substituted 0x3F
 >> Line 324, char 81713, byte 81751: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 1839, char 559834, byte 559890: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 1840, char 559919, byte 559975: code not allowed in XML: 0x000C,
 >> substituted 0x3F
 >> Line 1843, char 560213, byte 560269: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 1846, char 560475, byte 560531: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 1850, char 560807, byte 560863: code not allowed in XML: 0x000C,
 >> substituted 0x3F
 >> Line 1851, char 560911, byte 560967: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 2249, char 658132, byte 658188: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 2250, char 658230, byte 658286: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 2253, char 658449, byte 658505: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 2271, char 662207, byte 662263: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >> Line 2274, char 662411, byte 662467: code not allowed in XML: 0x000E,
 >> substituted 0x3F
 >> Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >>
 >> [utf8conditioner detected and did replacements for a number of
 >> characeters]
 >> oaiharvest.pl: Got 752 records (running total: 752)
 >> oaiharvest.pl: No resumptionToken, end of complete list.
 >>
 >> [expat could then parse response extracting 752 records, no
 >> resumptionToken]
 >>
 >> oaiharvest.pl: Done.
 >> simeon@ice 14Feb03>
 >>
 >>
 >> [doing the same tests with Xerces...]
 >>
 >> simeon@ice 14Feb03>xercesCountElements lr
 >> [Fatal Error] lr:320:76: An invalid XML character (Unicode: 0xc) was
 >> found in the element content of the document.
 >>
 >> simeon@ice 14Feb03>cat lr | utf8conditioner -x > lrc
 >> Line 320, char 81453, byte 81491: code not allowed in XML: 0x000C,
 >> substituted 0x3F
 >> [..etc, same output as above...]
 >> Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B,
 >> substituted 0x3F
 >>
 >> simeon@ice 14Feb03>xercesCountElements lrc
 >> lrc: 4665;519;10 ms (18104 elems, 3013 attrs, 0 spaces, 740197 chars)
 >>
 >>
 >> _______________________________________________
 >> OAI-implementers mailing list
 >> OAI-implementers@oaisrv.nsdl.cornell.edu
 >> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
 >>
 >
 > _______________________________________________
 > OAI-implementers mailing list
 > OAI-implementers@oaisrv.nsdl.cornell.edu
 > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers


-- 
=====================================================================
hussein suleman ~ hussein@cs.uct.ac.za ~ http://www.husseinsspace.com
=====================================================================