[OAI-implementers] Trouble parsing records with apache commons digester : UTF8 and xerces UTFDataFormatException

Simeon Warner simeon@cs.cornell.edu
Wed, 28 Jan 2004 16:52:45 -0500 (EST)


Thomas,

Sounds like the XML you are trying to parse is simply broken and you 
should therefore contact the data provider to get them to fix the problem.
If the XML declaration starts by declaring UTF-8 encoding: 
  <?xml ...  encoding="UTF-8"?>
then the data must be correctly encoded as UTF-8.

The hint you quote refers to editing XML files and does not apply to OAI
data providers as the OAI protocol mandates UTF-8 (see: 
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#XMLResponse)

I just tried harvesting from the registered baseURL for Sammelpunkt
(http://sammelpunkt.philo.at:8080/perl/oai2) and got a 500 error so there
seem to be some problems.

Cheers,
Simeon


On Thu, 15 Jan 2004, Thomas Krämer wrote:
> Hello,
> 
> i try parsing records with the commons digester, which works pretty fine, set you are not handling 
> special charactars such as german umlaute, french accents etc.
> 
> if found a hint at:
> 
> http://www.mail-archive.com/oxf-users@orbeon.com/msg00297.html which is not suitable for harvester
> applications.
> 
> shouldn't the providers be aware of the right character encoding?
> and: does anyone know how to handle this?
> 
> I am not sure about whether i making wrong assumtions or the handlind of character encoding is not 
> standardized yet.
> 
> an example:
> 
> i try to parse metadata records with the apache commons digester, which uses xerces.
> 
> unfortunately, all that metadata is declared as UTF-8, which causes a
> 
> 
> java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
>      at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown 
> Source)ava.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
>      at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
> 
> 
> when i try to read an xml file such as the one attached below.
> 
> 
> Any suggestions?
> 
> 
> 
> <?xml version="1.0" encoding="utf-8"?>
> <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"
> xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
> http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
> <dc:title>Medienphilosophie(n)</dc:title>         <dc:creator>Hartmann, Dr.
> Frank</dc:creator>         <dc:subject>Medienphilosophie, Theorie der
> Virtualität, Cyberphilosophie</dc:subject>         <dc:description>Die Frage, ob
> 
> ...
> 
> wird, auflösen wird lassen. Eine Rekonstruktion relevanter
> Positionen.</dc:description>         <dc:date>2002-01-01</dc:date>
> <dc:type>Book Chapter</dc:type>
> <dc:identifier>http://sammelpunkt.philo.at:8080/archive/00000103/</dc:identifier> <dc:format>html 
> http://sammelpunkt.philo.at:8080/archive/00000103/01/medienphilosophie.html</dc:format></oai_dc:dc>
> 
> 
> 
> 
> kind regards
> 
> thomas
> 
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>