[OAI-implementers] XML encoding problems with DSpace at MIT

Simeon Warner simeon@cs.cornell.edu
Fri, 14 Feb 2003 17:07:46 -0500 (EST)


In my recent post to oai-general
http://www.openarchives.org/pipermail/oai-general/2003-February/000258.html
I said I'd post a note about the current output of DSpace at MIT to this
list (which seems a more appropriate forum). I just ran a harvest and got
the log shown below, I've added comments in [].

Cheers,
Simeon.



simeon@ice 14Feb03>more log 
oaiharvest.pl: Harvest from http://hpds1.mit.edu/oai/ using POST
OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=Identify
OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
OAIGet: Got 200 OK (627bytes decoded to 1328bytes)

[nice, DSpace implements gzip content coding]

oaiharvest.pl: Identify reports OAI-PMH version 2.0
oaiharvest.pl: Doing complete harvest.
OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=ListMetadataFormats
OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
OAIGet: Got 200 OK (307bytes decoded to 643bytes)
OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=ListSets
OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
OAIGet: Got 200 OK (715bytes decoded to 2140bytes)
OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: metadataPrefix=oai_dc&verb=ListRecords
OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
OAIGet: Got 200 OK (254096bytes decoded to 1392720bytes)
oaiharvest.pl: UTF-8/XML errors in ListRecords.1:

[oops, expat parser fails on response
 my harvester now attempts to do replacement on bad XML/UTF8 bytes/chars
 using my utf8conditioner, details at 
 http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/ 
 Unless the response can be parsed we can't even know if there is a
 resumptionToken...] 

utf8conditioner: 
Line 320, char 81453, byte 81491: code not allowed in XML: 0x000C, substituted 0x3F
Line 324, char 81713, byte 81751: code not allowed in XML: 0x000B, substituted 0x3F
Line 1839, char 559834, byte 559890: code not allowed in XML: 0x000B, substituted 0x3F
Line 1840, char 559919, byte 559975: code not allowed in XML: 0x000C, substituted 0x3F
Line 1843, char 560213, byte 560269: code not allowed in XML: 0x000B, substituted 0x3F
Line 1846, char 560475, byte 560531: code not allowed in XML: 0x000B, substituted 0x3F
Line 1850, char 560807, byte 560863: code not allowed in XML: 0x000C, substituted 0x3F
Line 1851, char 560911, byte 560967: code not allowed in XML: 0x000B, substituted 0x3F
Line 2249, char 658132, byte 658188: code not allowed in XML: 0x000B, substituted 0x3F
Line 2250, char 658230, byte 658286: code not allowed in XML: 0x000B, substituted 0x3F
Line 2253, char 658449, byte 658505: code not allowed in XML: 0x000B, substituted 0x3F
Line 2271, char 662207, byte 662263: code not allowed in XML: 0x000B, substituted 0x3F
Line 2274, char 662411, byte 662467: code not allowed in XML: 0x000E, substituted 0x3F
Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B, substituted 0x3F

[utf8conditioner detected and did replacements for a number of characeters] 

oaiharvest.pl: Got 752 records (running total: 752)
oaiharvest.pl: No resumptionToken, end of complete list.

[expat could then parse response extracting 752 records, no resumptionToken]

oaiharvest.pl: Done.
simeon@ice 14Feb03>


[doing the same tests with Xerces...]

simeon@ice 14Feb03>xercesCountElements lr
[Fatal Error] lr:320:76: An invalid XML character (Unicode: 0xc) was found in the element content of the document.

simeon@ice 14Feb03>cat lr | utf8conditioner -x > lrc
Line 320, char 81453, byte 81491: code not allowed in XML: 0x000C, substituted 0x3F
[..etc, same output as above...]
Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B, substituted 0x3F

simeon@ice 14Feb03>xercesCountElements lrc
lrc: 4665;519;10 ms (18104 elems, 3013 attrs, 0 spaces, 740197 chars)