[OAI-implementers] character vs entity references

Thomas G. Habing thabing@uiuc.edu
Wed, 05 Nov 2003 12:30:52 -0600


Heinrich Stamerjohanns wrote:

> On Wed, 5 Nov 2003, Tim Brody wrote:
> 
> 
>>AFAIK a character reference is a reference into the Unicode character set,
>>so its invalid whether its in &#xx; form, utf-8, utf16 or whatever.
>>
> 
> I do not know what you exactly mean by that, but "ñ" is certainly a
> correct character reference. The byte presentation of characters
> above 127 is just different (ISO-8859-1:1 byte, UTF-8:more bytes),
> but the character-reference &#241 represents the same character in
> XML(iso-8859-1) and XML(UTF-8).
> 
> 
>>You should either remove the characters or convert the character to its
>>nearest equivalent in Unicode (for control characters there probably isn't
>>one).
> 
> 
> I remove invalid characters with this (PHP code with perlregex):
> // just remove invalid control characters
> $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
> $string = preg_replace($pattern,'',$string);
> 
> Heinrich
> 
> 

You need to be careful of characters in the x7F-x9F range.  In Unicode these 
are all control characters and are forbidden in XML 1.0.  But in many 
charsets these points are occupied by printable characters, such as in the 
Windows:Western charset where, for example, x8A is the S with caron, but in 
  Unicode this needs to be converted to x160.  If you just took this 
character and turned it into entity Š the resulting XML would not be valid.

-- 
Thomas Habing
Research Programmer, Digital Library Projects
University of Illinois at Urbana-Champaign
155 Grainger Engineering Library Information Center, MC-274
thabing@uiuc.edu, (217) 244-4425
http://dli.grainger.uiuc.edu