[OAI-implementers] valid character encoding

Simeon Warner simeon@cs.cornell.edu
Wed, 13 Aug 2003 11:27:55 -0400 (EDT)


On Wed, 13 Aug 2003, Todd White wrote:
> On Wed, 13 Aug 2003, Thomas G. Habing wrote:
> 
> > The OAI spec mandates that all XML responses must be encoded as UTF-8.
> 
> here's an example of a record that has a special character.  i'm not if
> i'm handling it correctly.  can anyone confirm?
> 
> http://michiganteacher.net/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:michiganteacher.net:120

You have "mus\'ee" in the title and the e acute is not UTF-8 encoded. You
have

0xE9	0x00E9	#LATIN SMALL LETTER E WITH ACUTE

You might find my little utf8conditioner code helpful for checking
(http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/)
your UTF8 output:

simeon@ice ~>cat oai.xml | ~/src/utf8/utf8conditioner -c
Line 22, char 1181, byte 1181: byte 2 isn't continuation: 0xE9 0x65, restart at 0x65, substituted 0x3F

The correct UTF-8 encoding for character code E9 is the two byte 
sequence C3 A9.

Cheers,
Simeon