[OAI-implementers] character vs entity references

Thomas G. Habing thabing@uiuc.edu
Fri, 07 Nov 2003 12:05:47 -0600

Thomas G. Habing wrote:

> You need to be careful of characters in the x7F-x9F range.  In Unicode 
> these are all control characters and are forbidden in XML 1.0.  But in 
> many charsets these points are occupied by printable characters, such as 
> in the Windows:Western charset where, for example, x8A is the S with 
> caron, but in  Unicode this needs to be converted to x160.  If you just 
> took this character and turned it into entity Š the resulting XML 
> would not be valid.

Hi all,

I need to amend this slightly.  Characters in the range x7F-x9F are legal in 
XML 1.0 and a compliant parser shouldn't complain about them (although I am 
pretty certain that some earlier XML parsers did complain about characters 
in this range).  In any case, you still need to be careful with these 
characters if you are converting from one of the Windows character sets.  A 
good description of the issue can be found at 
http://www.w3.org/International/questions/qa-controls.html.  Note that XML 
1.1 treats control characters somewhat differently than 1.0 in that it 
allows them but they can only be represented as Numeric Character References.

Thomas Habing
Research Programmer, Digital Library Projects
University of Illinois at Urbana-Champaign
155 Grainger Engineering Library Information Center, MC-274
thabing@uiuc.edu, (217) 244-4425