[OAI-implementers] valid character encoding
Wed, 13 Aug 2003 22:46:48 -0400 (EDT)
On Thu, 14 Aug 2003, Steve Thomas wrote:
> While we're on the topic, I have records with the name Niccolò in them (as in
> Machiavelli) -- there's a grave accent over the final "o". But this doesn't
> seem to be part of UTF-8, or your conditioner doesn't recognise it. (Although
> it displays correctly everywhere.)
> Is this invalid in UTF-8, or ... what?
> When I dump it in Unix, the character is \xf2, apparently.
0xF2 is NOT a valid UTF-8 sequence.
No single byte in the range 0x80--0xFF is a valid UTF-8 sequence. 0xF2 is
the Latin 1, CP1252 and Unicode code for o grave and is represented as a
two-byte sequence in UTF-8 (0xC3 0xB2).
If you have data in Latin 1 it is trivial to convert that to UTF-8 but you
must do the conversion before writing XML records for OAI use!
There seems to be some confusion about these issues so I'll attempt to
summarize a few key points:
o UTF-8 is a particular ENCODING of Unicode (UCS, ISO 10646). Individual
characters are represented by a sequence of between 1 and 6 bytes. Any
byte >= 0x80 is part of a multi-byte sequence.
o The ASCII characters (0x20-0x7F) have the same codes in Latin 1 (aka ISO
8859-1) and Unicode. They are also represented by single bytes with the
same values in a UTF-8 stream.
o The Latin 1 characters (0xC0-0xFF) have the same codes in Unicode. In
UTF-8 streams they are encoded as two-byte sequences. (Direct inclusion of
these codes in UTF-8 will likely result in invalid UTF-8 sequences and
will certainly not be correctly interpreted.)
o Almost every other character set can be mapped to Unicode but may
o There are libraries and tools to do character set conversion and
encoding in most common languages. For example, perl permits quite general
conversion; say latin1 to utf8:
$utf8data = encode("utf8", decode("iso-8859-1", $latin1data));
For more details see:
I hope this helps.