[OAI-implementers] character encoding

Tim Brody tdb01r@ecs.soton.ac.uk
Fri, 31 Oct 2003 12:18:30 -0000


----- Original Message ----- 
From: "Todd White" <tmwhite@merit.edu>


> i sent a message to the list some time ago and, while working on other
> non-XML and non-OAI projects, i've been closing watching the list in hopes
> of finding the solution to my encoding problem.  i'm embarrassed to admit
> that this encoding problem remains.
>
> perhaps i should provide some details...
>
> DATA STORAGE:  Oracle
> DATA DELIVERY:  DBI.pm
> OAI CONSTRUCTOR:  Perl script (using Embperl)
> WEB SERVER:  Apache
>
> in other words, i have a single Perl script, in the form an Embperl file,
> that draws the data from Oracle, via DBI, then i simply loop through the
> data and wrap each element with the appropriate XML tag before returning
> the whole mess through STDOUT.
>
> i'm guessing that i should encode each character to UTF-8 as it passes
> through the script, but as yet, i'm not sure how to best do this.
>
> any helpful tips, advice, rants, etc. will be most welcome.  i thank you
> in advance.
>
>  -Todd

I strongly urge you to use a 5.8.x version of Perl, as it has built-in
support for UTF-8.

As you are outputting via STDOUT you should use:
binmode(STDOUT,":utf8");
Which is pretty self-explanatory :-)

You need to find out what character coding your data is in, and convert it
into UTF-8. e.g. if your data is in ISO-8859-1 ("Latin-1, West Europe") you
would do something like:

use Encode; # Functions for converting strings between encodings
use utf8; # Tell Perl that you are using UTF-8 in your program

$sth = DBI::connect(...)->prepare("SELECT FROM DB");
$sth->execute;
my ($str_latin) = $sth->fetchrow_array();

my $str_utf8 = decode("iso-8859-1",$str_latin);

print $str_utf8; # n.b. you will still need to escape <>"& in string data

__end__

UTF-8 also restricts control characters, so you may need to do something
like:
$str_utf8 =~ s/[\x00-\x08\x0b-\x0d\x0e-\x1f]//sg; # Remove all control
characters except newline (\n)

There are quite a few utility functions in Encode for handling encodings, so
is well worth taking a look at the help page.
(I gotcha I have noticed is Perl modules that are written in C may not flag
a string as UTF-8, even though the data is. There are methods in Encode for
changing this flag - but should be used with caution!)

All the best,
Tim.