[OAI-implementers] SPECIAL CHARACTERS...

Tim Brody tim@tim.brody.btinternet.co.uk
Wed, 2 Oct 2002 13:05:12 +0100


(Only tested using the Perl expat parser  ...)

I don't *think* your solution will cover all situations (e.g., it didn't
encode the last of the three example latin characters). Exhaustively parsing
all 8-bit character codes produces the following required regexps to go from
raw any-ascii text to UTF-8 parsable (i.e. a shot-gun approach):

s/&/&/sg;
s/</&lt;/sg;
s/>/&gt;/sg;
s/[\x00-\x08\x0b-\x0c\x0e-\x1f]//sg;
s/([\x80-\xff])/sprintf("&#x%04x;",ord($1))/seg;

This will delete any control characters that aren't valid Unicode, and
entity-encode characters above 127 (note, there are control characters above
127 in the Unicode database but these seem to be accepted by the parser
...).

It would still be better to use a proper encoding transform than rely on
regexps :-)

Regards,
Tim.

----- Original Message -----
From: "Marina Muilwijk" <m.muilwijk@library.uu.nl>
To: "OAI Implementers" <oai-implementers@oaisrv.nsdl.cornell.edu>
Sent: Friday, September 27, 2002 2:46 PM
Subject: Re: [OAI-implementers] SPECIAL CHARACTERS...


On 27 Sep 2002 at 10:06, Ramon Martins Sodoma da Fonseca wrote:

> We are having problems with the character encoding.
> We need to display special charaters, like "ç, ã, ö", and others, and
> our question is:

We use Perl's sprintf function. For instance:
$creators =~ s/([^<>:a-zA-Z, .\/-])/sprintf "&#x%04X;", ord($1)/ei;

which converts everything but the characters within brackets to their
hexadecimal value and adds the "&#X" required for Unicode encoding.



_______________________________________________
OAI-implementers mailing list
OAI-implementers@oaisrv.nsdl.cornell.edu
http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers