[OAI-implementers] SPECIAL CHARACTERS...

Simeon Warner simeon@cs.cornell.edu
Wed, 2 Oct 2002 10:27:42 -0400 (EDT)


Tim mentioned encoding support in perl5.8 in his earlier post and I tried
it out some time ago. It seems pretty good and is probably a good 
solution if the data coming from your database is in a well-defined (and 
supported) encoding such as latin1.

I played with the "from_to" function supplied by the "Encode" module and
it seems very easy to use. These functions will write multi-byte
characters instead of entities but that is fine.

The entity encoding of gt, lt, amp and quot is a separate XML issue which
should be handled by whatever XML writing code you are using.

Cheers,
Simeon.





Code I played with is below, test with:

simeon@ice ~>echo "" | convert-encoding.pl -f ISO-8859-1 -t utf8
çãö
simeon@ice ~>

where the gibberish çãö is actually the correct utf8 bytes displayed
incorrectly on my terminal, perhaps octal makes it more obvious:

simeon@ice ~>echo "" | convert-encoding.pl -f ISO-8859-1 -t utf8 | hexdump -c
0000000 303 247 303 243 303 266  \n                                    



#!/usr/bin/perl5.8.0
#
use strict;
use Getopt::Std;
use vars qw($opt_f $opt_t $opt_h);
my $FROM='utf8';
my $TO='utf8';
unless ((&getopts('f:t:h') && !$opt_h)) {
  die "usage: $0 [-f from] [-t to] [-h]\n
Convert bytestream from one encoding to another.
  -f from   set incoming encoding [default $FROM]
  -t to     set outgoing encoding [default $TO]
  -h        this help.\n";
}
my $from = $opt_f || $FROM;
my $to = $opt_t || $TO;
use Encode 'from_to';

undef $/; #make read to string slurp all file 
my $data=<STDIN>;
&from_to($data, $from, $to); # from legacy to utf-8  
print $data;



On Wed, 2 Oct 2002, Tim Brody wrote:
> (Only tested using the Perl expat parser  ...)
> 
> I don't *think* your solution will cover all situations (e.g., it didn't
> encode the last of the three example latin characters). Exhaustively parsing
> all 8-bit character codes produces the following required regexps to go from
> raw any-ascii text to UTF-8 parsable (i.e. a shot-gun approach):
> 
> s/&/&amp;/sg;
> s/</&lt;/sg;
> s/>/&gt;/sg;
> s/[\x00-\x08\x0b-\x0c\x0e-\x1f]//sg;
> s/([\x80-\xff])/sprintf("&#x%04x;",ord($1))/seg;
> 
> This will delete any control characters that aren't valid Unicode, and
> entity-encode characters above 127 (note, there are control characters above
> 127 in the Unicode database but these seem to be accepted by the parser
> ...).
> 
> It would still be better to use a proper encoding transform than rely on
> regexps :-)
> 
> Regards,
> Tim.
> 
> ----- Original Message -----
> From: "Marina Muilwijk" <m.muilwijk@library.uu.nl>
> To: "OAI Implementers" <oai-implementers@oaisrv.nsdl.cornell.edu>
> Sent: Friday, September 27, 2002 2:46 PM
> Subject: Re: [OAI-implementers] SPECIAL CHARACTERS...
> 
> 
> On 27 Sep 2002 at 10:06, Ramon Martins Sodoma da Fonseca wrote:
> 
> > We are having problems with the character encoding.
> > We need to display special charaters, like ", , ", and others, and
> > our question is:
> 
> We use Perl's sprintf function. For instance:
> $creators =~ s/([^<>:a-zA-Z, .\/-])/sprintf "&#x%04X;", ord($1)/ei;
> 
> which converts everything but the characters within brackets to their
> hexadecimal value and adds the "&#X" required for Unicode encoding.
> 
> 
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>