[OAI-implementers] XML encoding problems with DSpace at MIT

Hussein Suleman hussein@cs.uct.ac.za
Sat, 15 Feb 2003 17:23:46 +0200


hi

i think Tim poses a very relevant question: do we deal with the 
so-called "real-world" encoding problems or do we try to encourage 
people to fix their implementations? (of course, for research purposes, 
we may end up doing both :))

personally, the code i distribute to others does quite a lot of XML 
cleaning in the data provider, but none at all in the harvester. i think 
the basic philosophy i'm following is: clean data as close to the source 
as possible. also, i believe one of the reasons the adminEmail field in 
Identify responses is required is so that a service provider can contact 
the administrator if there are problems with the data.

and now that the hype about OAI2 is dying down, i wonder how much (if 
any) more testing we need. i have some ideas to enhance, complement and 
possibly even replace the repository explorer in the next year ... it 
all depends on finding time and/or students/colleagues with time :)

ttfn,
----hussein


Tim Brody wrote:
> http://celestial.eprints.org/cgi-bin/status?action=repository;metadataFormat=17 
> 
> 
> Harvested 752 records - I've also implemented some 
> character-substitution to fix encoding errors, although this is probably 
> not as proficient as Simeon's!
> 
> The question is, the more harvesters implement fixes the less pressure 
> there is on repositories to fix their output, so should harvesters 
> accept bad-XML?
> (once that question is answered, harvesters have to decide how much 
> normalisation of metadata they do :-)
> 
> All the best,
> Tim.
> 
> Simeon Warner wrote:
> 
>> In my recent post to oai-general
>> http://www.openarchives.org/pipermail/oai-general/2003-February/000258.html 
>>
>> I said I'd post a note about the current output of DSpace at MIT to this
>> list (which seems a more appropriate forum). I just ran a harvest and got
>> the log shown below, I've added comments in [].
>>
>> Cheers,
>> Simeon.
>>
>>
>>
>> simeon@ice 14Feb03>more log oaiharvest.pl: Harvest from 
>> http://hpds1.mit.edu/oai/ using POST
>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=Identify
>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>> OAIGet: Got 200 OK (627bytes decoded to 1328bytes)
>>
>> [nice, DSpace implements gzip content coding]
>>
>> oaiharvest.pl: Identify reports OAI-PMH version 2.0
>> oaiharvest.pl: Doing complete harvest.
>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: 
>> verb=ListMetadataFormats
>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>> OAIGet: Got 200 OK (307bytes decoded to 643bytes)
>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=ListSets
>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>> OAIGet: Got 200 OK (715bytes decoded to 2140bytes)
>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: 
>> metadataPrefix=oai_dc&verb=ListRecords
>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>> OAIGet: Got 200 OK (254096bytes decoded to 1392720bytes)
>> oaiharvest.pl: UTF-8/XML errors in ListRecords.1:
>>
>> [oops, expat parser fails on response
>>  my harvester now attempts to do replacement on bad XML/UTF8 bytes/chars
>>  using my utf8conditioner, details at 
>>  http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/ 
>>  Unless the response can be parsed we can't even know if there is a
>>  resumptionToken...]
>> utf8conditioner: Line 320, char 81453, byte 81491: code not allowed in 
>> XML: 0x000C, substituted 0x3F
>> Line 324, char 81713, byte 81751: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 1839, char 559834, byte 559890: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 1840, char 559919, byte 559975: code not allowed in XML: 0x000C, 
>> substituted 0x3F
>> Line 1843, char 560213, byte 560269: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 1846, char 560475, byte 560531: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 1850, char 560807, byte 560863: code not allowed in XML: 0x000C, 
>> substituted 0x3F
>> Line 1851, char 560911, byte 560967: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 2249, char 658132, byte 658188: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 2250, char 658230, byte 658286: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 2253, char 658449, byte 658505: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 2271, char 662207, byte 662263: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>> Line 2274, char 662411, byte 662467: code not allowed in XML: 0x000E, 
>> substituted 0x3F
>> Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>>
>> [utf8conditioner detected and did replacements for a number of 
>> characeters]
>> oaiharvest.pl: Got 752 records (running total: 752)
>> oaiharvest.pl: No resumptionToken, end of complete list.
>>
>> [expat could then parse response extracting 752 records, no 
>> resumptionToken]
>>
>> oaiharvest.pl: Done.
>> simeon@ice 14Feb03>
>>
>>
>> [doing the same tests with Xerces...]
>>
>> simeon@ice 14Feb03>xercesCountElements lr
>> [Fatal Error] lr:320:76: An invalid XML character (Unicode: 0xc) was 
>> found in the element content of the document.
>>
>> simeon@ice 14Feb03>cat lr | utf8conditioner -x > lrc
>> Line 320, char 81453, byte 81491: code not allowed in XML: 0x000C, 
>> substituted 0x3F
>> [..etc, same output as above...]
>> Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B, 
>> substituted 0x3F
>>
>> simeon@ice 14Feb03>xercesCountElements lrc
>> lrc: 4665;519;10 ms (18104 elems, 3013 attrs, 0 spaces, 740197 chars)
>>
>>
>> _______________________________________________
>> OAI-implementers mailing list
>> OAI-implementers@oaisrv.nsdl.cornell.edu
>> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers


-- 
=====================================================================
hussein suleman ~ hussein@cs.uct.ac.za ~ http://www.husseinsspace.com
=====================================================================