[OAI-implementers] character vs entity references

Heinrich Stamerjohanns stamer@uni-oldenburg.de
Wed, 5 Nov 2003 19:09:11 +0100 (CET)


On Wed, 5 Nov 2003, Tim Brody wrote:

> AFAIK a character reference is a reference into the Unicode character set,
> so its invalid whether its in &#xx; form, utf-8, utf16 or whatever.
>
I do not know what you exactly mean by that, but "ñ" is certainly a
correct character reference. The byte presentation of characters
above 127 is just different (ISO-8859-1:1 byte, UTF-8:more bytes),
but the character-reference &#241 represents the same character in
XML(iso-8859-1) and XML(UTF-8).

> You should either remove the characters or convert the character to its
> nearest equivalent in Unicode (for control characters there probably isn't
> one).

I remove invalid characters with this (PHP code with perlregex):
// just remove invalid control characters
$pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
$string = preg_replace($pattern,'',$string);

Heinrich

> e.g.
> use XML::Parser;
> while(<>) {
> s/([^ -~])/'&#' . ord($1) . ';'/eg;
> print $_, "\n";
> eval { XML::Parser->new->parse("<root>$_</root>"); };
> print $@ if $@;
> }
>
> Creates errors if you input formfeed (CTRL+L).
>
> All the best,
> Tim.
>
> > On Tue, 4 Nov 2003, Ed Summers wrote:
> >
> > > On Tue, Nov 04, 2003 at 09:58:55AM -0500, Todd White wrote:
> > > > $string =~ tr/\0-\x{ff}//UC;
> > >
> > > Search for tr/ in the following pages for some fun Perl archaeology.
> > >
> > >     http://www.perldoc.com/perl5.005_03/pod/perlop.html
> > >     http://www.perldoc.com/perl5.6.0/pod/perlop.html
> > >     http://www.perldoc.com/perl5.6.1/pod/perlop.html
> > >
> > > You can see the UC modifiers were introduced in 5.6.0 and quickly
> > > dropped in 5.6.1 (and in versions thereafter). 5.6.0 is a notoriously
> > > buggy release, I think in part because of it's UTF8 handling. These
> > > problems have been fixed in versions >= 5.8.0, which is the first
> > > recommended release of Perl for safely working with UTF8.
> > >
> > > Funny, I always thought Perl held backwards compatability sacrosanct...
> > > not including Perl6 of course :)
> > >
> > > You might be interested in this list for Perl library folks:
> > > http://perl4lib.perl.org for discussion of Perl esoterica and more.
> > >
> > > //Ed
> > > _______________________________________________
> > > OAI-implementers mailing list
> > > List information, archives, preferences and to unsubscribe:
> > > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> > >
> > >
> >
> > _______________________________________________
> > OAI-implementers mailing list
> > List information, archives, preferences and to unsubscribe:
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> >
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>
>
>

--
  Dr. Heinrich Stamerjohanns        Tel. +49-441-798-4276
  Institute for Science Networking  stamer@uni-oldenburg.de
  University of Oldenburg           http://isn.uni-oldenburg.de/~stamer