[OAI-implementers] SPECIAL CHARACTERS...

Mark Doyle doyle@aps.org
Wed, 2 Oct 2002 11:47:02 -0400


Greetings,

Please be aware that there seems to be some subtle issues with 5.8 and
XML::Parser and XML::DOM that have cropped up on the perl-xml mailing
list. See for instance

http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/1380541

Cheers,
Mark

On Wednesday, October 2, 2002, at 10:27 AM, Simeon Warner wrote:

>
> Tim mentioned encoding support in perl5.8 in his earlier post and I 
> tried
> it out some time ago. It seems pretty good and is probably a good
> solution if the data coming from your database is in a well-defined 
> (and
> supported) encoding such as latin1.
>
> I played with the "from_to" function supplied by the "Encode" module 
> and
> it seems very easy to use. These functions will write multi-byte
> characters instead of entities but that is fine.
>
> The entity encoding of gt, lt, amp and quot is a separate XML issue 
> which
> should be handled by whatever XML writing code you are using.
>
> Cheers,
> Simeon.
>
>
>
>
>
> Code I played with is below, test with:
>
> simeon@ice ~>echo "" | convert-encoding.pl -f ISO-8859-1 -t utf8
> çãö
> simeon@ice ~>
>
> where the gibberish çãö is actually the correct utf8 bytes displayed
> incorrectly on my terminal, perhaps octal makes it more obvious:
>
> simeon@ice ~>echo "" | convert-encoding.pl -f ISO-8859-1 -t utf8 | 
> hexdump -c
> 0000000 303 247 303 243 303 266  \n
>
>
>
> #!/usr/bin/perl5.8.0
> #
> use strict;
> use Getopt::Std;
> use vars qw($opt_f $opt_t $opt_h);
> my $FROM='utf8';
> my $TO='utf8';
> unless ((&getopts('f:t:h') && !$opt_h)) {
>   die "usage: $0 [-f from] [-t to] [-h]\n
> Convert bytestream from one encoding to another.
>   -f from   set incoming encoding [default $FROM]
>   -t to     set outgoing encoding [default $TO]
>   -h        this help.\n";
> }
> my $from = $opt_f || $FROM;
> my $to = $opt_t || $TO;
> use Encode 'from_to';
>
> undef $/; #make read to string slurp all file
> my $data=<STDIN>;
> &from_to($data, $from, $to); # from legacy to utf-8
> print $data;
>
>
>
> On Wed, 2 Oct 2002, Tim Brody wrote:
>> (Only tested using the Perl expat parser  ...)
>>
>> I don't *think* your solution will cover all situations (e.g., it 
>> didn't
>> encode the last of the three example latin characters). Exhaustively 
>> parsing
>> all 8-bit character codes produces the following required regexps to 
>> go from
>> raw any-ascii text to UTF-8 parsable (i.e. a shot-gun approach):
>>
>> s/&/&amp;/sg;
>> s/</&lt;/sg;
>> s/>/&gt;/sg;
>> s/[\x00-\x08\x0b-\x0c\x0e-\x1f]//sg;
>> s/([\x80-\xff])/sprintf("&#x%04x;",ord($1))/seg;
>>
>> This will delete any control characters that aren't valid Unicode, and
>> entity-encode characters above 127 (note, there are control 
>> characters above
>> 127 in the Unicode database but these seem to be accepted by the 
>> parser
>> ...).
>>
>> It would still be better to use a proper encoding transform than rely 
>> on
>> regexps :-)
>>
>> Regards,
>> Tim.
>>
>> ----- Original Message -----
>> From: "Marina Muilwijk" <m.muilwijk@library.uu.nl>
>> To: "OAI Implementers" <oai-implementers@oaisrv.nsdl.cornell.edu>
>> Sent: Friday, September 27, 2002 2:46 PM
>> Subject: Re: [OAI-implementers] SPECIAL CHARACTERS...
>>
>>
>> On 27 Sep 2002 at 10:06, Ramon Martins Sodoma da Fonseca wrote:
>>
>>> We are having problems with the character encoding.
>>> We need to display special charaters, like ", , ", and others, and
>>> our question is:
>>
>> We use Perl's sprintf function. For instance:
>> $creators =~ s/([^<>:a-zA-Z, .\/-])/sprintf "&#x%04X;", ord($1)/ei;
>>
>> which converts everything but the characters within brackets to their
>> hexadecimal value and adds the "&#X" required for Unicode encoding.
>>
>>
>>
>> _______________________________________________
>> OAI-implementers mailing list
>> OAI-implementers@oaisrv.nsdl.cornell.edu
>> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>
>> _______________________________________________
>> OAI-implementers mailing list
>> OAI-implementers@oaisrv.nsdl.cornell.edu
>> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>