[OAI-implementers] XML encoding problems with DSpace at MIT

Kat Hagedorn khage@umich.edu
Mon, 17 Feb 2003 17:23:03 -0500


I'm also with Hussein and Caroline. As a service provider, we do our  
best to notify data providers when we run into errors that seem to not  
be problems with our harvester. We can always clean the data at our end  
if we can (and as close to the source as possible), but it's infinitely  
easier in the long run for us to notify the data provider. As a result,  
we don't waste time on our end and other service providers will harvest  
cleaner data. The communication ends up benefitting everyone.

- Kat

On Saturday, Feb 15, 2003, at 11:25 America/Detroit, Caroline Arms  
wrote:

>
> As a data provider, LC would like to know if it is generating invalid
> characters.  The gradual migration to UNICODE is going to give us all
> problems, in part BECAUSE some systems work so hard to recognize  
> different
> character encodings and adjust.  I'm with Hussein.  Notify data  
> providers
> of problems (even if you do adjust) so that the problem can be fixed as
> close to home as possible.
>
> As a related aside, if anyone has a suggestion for an efficient way
> (preferably unix-based) to check that the metadata in a PDF file is  
> stored
> in UTF-8 encoding (or consistently in any other UNICODE encoding), I'd  
> be
> interested.
>
> Caroline Arms
> Office of Strategic Initiatives
> Library of Congress
>
> On Sat, 15 Feb 2003, Hussein Suleman wrote:
>
>> hi
>>
>> i think Tim poses a very relevant question: do we deal with the
>> so-called "real-world" encoding problems or do we try to encourage
>> people to fix their implementations? (of course, for research  
>> purposes,
>> we may end up doing both :))
>>
>> personally, the code i distribute to others does quite a lot of XML
>> cleaning in the data provider, but none at all in the harvester. i  
>> think
>> the basic philosophy i'm following is: clean data as close to the  
>> source
>> as possible. also, i believe one of the reasons the adminEmail field  
>> in
>> Identify responses is required is so that a service provider can  
>> contact
>> the administrator if there are problems with the data.
>>
>> and now that the hype about OAI2 is dying down, i wonder how much (if
>> any) more testing we need. i have some ideas to enhance, complement  
>> and
>> possibly even replace the repository explorer in the next year ... it
>> all depends on finding time and/or students/colleagues with time :)
>>
>> ttfn,
>> ----hussein
>>
>>
>> Tim Brody wrote:
>>>
>> http://celestial.eprints.org/cgi-bin/ 
>> status?action=repository;metadataFormat=17
>>
>>>
>>>
>>> Harvested 752 records - I've also implemented some
>>> character-substitution to fix encoding errors, although this is  
>>> probably
>>> not as proficient as Simeon's!
>>>
>>> The question is, the more harvesters implement fixes the less  
>>> pressure
>>> there is on repositories to fix their output, so should harvesters
>>> accept bad-XML?
>>> (once that question is answered, harvesters have to decide how much
>>> normalisation of metadata they do :-)
>>>
>>> All the best,
>>> Tim.
>>>
>>> Simeon Warner wrote:
>>>
>>>> In my recent post to oai-general
>>>>
>> http://www.openarchives.org/pipermail/oai-general/2003-February/ 
>> 000258.html
>>>>
>>>> I said I'd post a note about the current output of DSpace at MIT to  
>>>> this
>>>> list (which seems a more appropriate forum). I just ran a harvest
>> and got
>>>> the log shown below, I've added comments in [].
>>>>
>>>> Cheers,
>>>> Simeon.
>>>>
>>>>
>>>>
>>>> simeon@ice 14Feb03>more log oaiharvest.pl: Harvest from
>>>> http://hpds1.mit.edu/oai/ using POST
>>>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=Identify
>>>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>>>> OAIGet: Got 200 OK (627bytes decoded to 1328bytes)
>>>>
>>>> [nice, DSpace implements gzip content coding]
>>>>
>>>> oaiharvest.pl: Identify reports OAI-PMH version 2.0
>>>> oaiharvest.pl: Doing complete harvest.
>>>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args:
>>>> verb=ListMetadataFormats
>>>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>>>> OAIGet: Got 200 OK (307bytes decoded to 643bytes)
>>>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=ListSets
>>>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>>>> OAIGet: Got 200 OK (715bytes decoded to 2140bytes)
>>>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args:
>>>> metadataPrefix=oai_dc&verb=ListRecords
>>>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>>>> OAIGet: Got 200 OK (254096bytes decoded to 1392720bytes)
>>>> oaiharvest.pl: UTF-8/XML errors in ListRecords.1:
>>>>
>>>> [oops, expat parser fails on response
>>>>  my harvester now attempts to do replacement on bad XML/UTF8  
>>>> bytes/chars
>>>>  using my utf8conditioner, details at
>>>>  http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/
>>>>  Unless the response can be parsed we can't even know if there is a
>>>>  resumptionToken...]
>>>> utf8conditioner: Line 320, char 81453, byte 81491: code not allowed  
>>>> in
>>>> XML: 0x000C, substituted 0x3F
>>>> Line 324, char 81713, byte 81751: code not allowed in XML: 0x000B,
>>>> substituted 0x3F
>>>> Line 1839, char 559834, byte 559890: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>> Line 1840, char 559919, byte 559975: code not allowed in XML:  
>>>> 0x000C,
>>>> substituted 0x3F
>>>> Line 1843, char 560213, byte 560269: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>> Line 1846, char 560475, byte 560531: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>> Line 1850, char 560807, byte 560863: code not allowed in XML:  
>>>> 0x000C,
>>>> substituted 0x3F
>>>> Line 1851, char 560911, byte 560967: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>> Line 2249, char 658132, byte 658188: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>> Line 2250, char 658230, byte 658286: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>> Line 2253, char 658449, byte 658505: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>> Line 2271, char 662207, byte 662263: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>> Line 2274, char 662411, byte 662467: code not allowed in XML:  
>>>> 0x000E,
>>>> substituted 0x3F
>>>> Line 2287, char 663373, byte 663429: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>>
>>>> [utf8conditioner detected and did replacements for a number of
>>>> characeters]
>>>> oaiharvest.pl: Got 752 records (running total: 752)
>>>> oaiharvest.pl: No resumptionToken, end of complete list.
>>>>
>>>> [expat could then parse response extracting 752 records, no
>>>> resumptionToken]
>>>>
>>>> oaiharvest.pl: Done.
>>>> simeon@ice 14Feb03>
>>>>
>>>>
>>>> [doing the same tests with Xerces...]
>>>>
>>>> simeon@ice 14Feb03>xercesCountElements lr
>>>> [Fatal Error] lr:320:76: An invalid XML character (Unicode: 0xc) was
>>>> found in the element content of the document.
>>>>
>>>> simeon@ice 14Feb03>cat lr | utf8conditioner -x > lrc
>>>> Line 320, char 81453, byte 81491: code not allowed in XML: 0x000C,
>>>> substituted 0x3F
>>>> [..etc, same output as above...]
>>>> Line 2287, char 663373, byte 663429: code not allowed in XML:  
>>>> 0x000B,
>>>> substituted 0x3F
>>>>
>>>> simeon@ice 14Feb03>xercesCountElements lrc
>>>> lrc: 4665;519;10 ms (18104 elems, 3013 attrs, 0 spaces, 740197  
>>>> chars)
>>>>
>>>>
>>>> _______________________________________________
>>>> OAI-implementers mailing list
>>>> OAI-implementers@oaisrv.nsdl.cornell.edu
>>>> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>>>
>>>
>>> _______________________________________________
>>> OAI-implementers mailing list
>>> OAI-implementers@oaisrv.nsdl.cornell.edu
>>> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>
>>
>> -- 
>> =====================================================================
>> hussein suleman ~ hussein@cs.uct.ac.za ~ http://www.husseinsspace.com
>> =====================================================================
>>
>>
>> _______________________________________________
>> OAI-implementers mailing list
>> OAI-implementers@oaisrv.nsdl.cornell.edu
>> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>
-------------------
Kat Hagedorn
OAIster/Metadata Harvesting Librarian
Digital Library Production Service
University of Michigan

http://www.oaister.org/
khage@umich.edu
734-615-7618