[OAI-implementers] valid character encoding

Simeon Warner simeon@cs.cornell.edu
Wed, 13 Aug 2003 22:46:48 -0400 (EDT)


On Thu, 14 Aug 2003, Steve Thomas wrote:
> While we're on the topic, I have records with the name Niccolò in them (as in 
> Machiavelli) -- there's a grave accent over the final "o". But this doesn't 
> seem to be part of UTF-8, or your conditioner doesn't recognise it. (Although 
> it displays correctly everywhere.)
> 
> Is this invalid in UTF-8, or ... what?
>
> When I dump it in Unix, the character is \xf2, apparently.

0xF2 is NOT a valid UTF-8 sequence.

No single byte in the range 0x80--0xFF is a valid UTF-8 sequence. 0xF2 is
the Latin 1, CP1252 and Unicode code for o grave and is represented as a
two-byte sequence in UTF-8 (0xC3 0xB2).

If you have data in Latin 1 it is trivial to convert that to UTF-8 but you
must do the conversion before writing XML records for OAI use!


There seems to be some confusion about these issues so I'll attempt to
summarize a few key points:

o UTF-8 is a particular ENCODING of Unicode (UCS, ISO 10646). Individual
characters are represented by a sequence of between 1 and 6 bytes. Any
byte >= 0x80 is part of a multi-byte sequence.

o The ASCII characters (0x20-0x7F) have the same codes in Latin 1 (aka ISO
8859-1) and Unicode. They are also represented by single bytes with the
same values in a UTF-8 stream.

o The Latin 1 characters (0xC0-0xFF) have the same codes in Unicode. In
UTF-8 streams they are encoded as two-byte sequences. (Direct inclusion of
these codes in UTF-8 will likely result in invalid UTF-8 sequences and
will certainly not be correctly interpreted.)

o Almost every other character set can be mapped to Unicode but may 
require look-up-tables. 

o There are libraries and tools to do character set conversion and 
encoding in most common languages. For example, perl permits quite general 
conversion; say latin1 to utf8:
  #see http://search.cpan.org/author/JHI/perl-5.8.0/ext/Encode/Encode.pm
  use Encode; 
  $utf8data = encode("utf8", decode("iso-8859-1", $latin1data));

For more details see:
  http://www.cl.cam.ac.uk/~mgk25/unicode.html (FAQ)
  http://www.ietf.org/rfc/rfc2279.txt (UTF-8)
  http://www.unicode.org/standard/standard.html (Unicode)


I hope this helps.

Cheers,
Simeon.