[OAI-implementers] character vs entity references

Todd White tmwhite@merit.edu
Wed, 5 Nov 2003 08:37:09 -0500 (EST)


since i've been "bugging" the list with my recent questions about
character encoding, i thought i would share the current solution that i've
implemented in our OAI repository.  it's one line and i added it to a
function that i had already implemented for processing all data as it
passes from database to XML...

 $str =~ s/([^ -~])/'&#' . ord($1) . ';'/eg;

this looks for any characters outside of the range from [space] to [tilde]
and transforms each to its proper character reference.  for example, if an
n-tilde is encountered, it is transformed into ñ


thanks for the help many of you provided!


On Tue, 4 Nov 2003, Ed Summers wrote:

> On Tue, Nov 04, 2003 at 09:58:55AM -0500, Todd White wrote:
> > $string =~ tr/\0-\x{ff}//UC;
> 
> Search for tr/ in the following pages for some fun Perl archaeology. 
> 
>     http://www.perldoc.com/perl5.005_03/pod/perlop.html
>     http://www.perldoc.com/perl5.6.0/pod/perlop.html
>     http://www.perldoc.com/perl5.6.1/pod/perlop.html
> 
> You can see the UC modifiers were introduced in 5.6.0 and quickly 
> dropped in 5.6.1 (and in versions thereafter). 5.6.0 is a notoriously
> buggy release, I think in part because of it's UTF8 handling. These
> problems have been fixed in versions >= 5.8.0, which is the first
> recommended release of Perl for safely working with UTF8.
> 
> Funny, I always thought Perl held backwards compatability sacrosanct...
> not including Perl6 of course :) 
> 
> You might be interested in this list for Perl library folks: 
> http://perl4lib.perl.org for discussion of Perl esoterica and more.
> 
> //Ed
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> 
>