[OAI-implementers] character vs entity references

Tim Brody tdb01r@ecs.soton.ac.uk
Wed, 5 Nov 2003 17:04:39 -0000


----- Original Message ----- 
From: "Todd White" <tmwhite@merit.edu>


> since i've been "bugging" the list with my recent questions about
> character encoding, i thought i would share the current solution that i've
> implemented in our OAI repository.  it's one line and i added it to a
> function that i had already implemented for processing all data as it
> passes from database to XML...
>
>  $str =~ s/([^ -~])/'&#' . ord($1) . ';'/eg;
>
> this looks for any characters outside of the range from [space] to [tilde]
> and transforms each to its proper character reference.  for example, if an
> n-tilde is encountered, it is transformed into &#241;

AFAIK a character reference is a reference into the Unicode character set,
so its invalid whether its in &#xx; form, utf-8, utf16 or whatever.

You should either remove the characters or convert the character to its
nearest equivalent in Unicode (for control characters there probably isn't
one).

e.g.
use XML::Parser;
while(<>) {
s/([^ -~])/'&#' . ord($1) . ';'/eg;
print $_, "\n";
eval { XML::Parser->new->parse("<root>$_</root>"); };
print $@ if $@;
}

Creates errors if you input formfeed (CTRL+L).

All the best,
Tim.

> On Tue, 4 Nov 2003, Ed Summers wrote:
>
> > On Tue, Nov 04, 2003 at 09:58:55AM -0500, Todd White wrote:
> > > $string =~ tr/\0-\x{ff}//UC;
> >
> > Search for tr/ in the following pages for some fun Perl archaeology.
> >
> >     http://www.perldoc.com/perl5.005_03/pod/perlop.html
> >     http://www.perldoc.com/perl5.6.0/pod/perlop.html
> >     http://www.perldoc.com/perl5.6.1/pod/perlop.html
> >
> > You can see the UC modifiers were introduced in 5.6.0 and quickly
> > dropped in 5.6.1 (and in versions thereafter). 5.6.0 is a notoriously
> > buggy release, I think in part because of it's UTF8 handling. These
> > problems have been fixed in versions >= 5.8.0, which is the first
> > recommended release of Perl for safely working with UTF8.
> >
> > Funny, I always thought Perl held backwards compatability sacrosanct...
> > not including Perl6 of course :)
> >
> > You might be interested in this list for Perl library folks:
> > http://perl4lib.perl.org for discussion of Perl esoterica and more.
> >
> > //Ed
> > _______________________________________________
> > OAI-implementers mailing list
> > List information, archives, preferences and to unsubscribe:
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> >
> >
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>