[OAI-implementers] Error in Request:GetRecord

Henry Stern henry@stern.ca
Sat, 18 Aug 2001 09:34:02 -0300


There is a slightly more elegant solution that you can use to encode the
multi-byte unicode characters.  I'll paste an excerpt from Bert
Degenhart Drenth's paper "Report on the CIMI XML Dublin Core DTD."

---

UTF-8 encoding (RFC2044)
In UTF-8 encoding wide characters, such as 16 bit Unicode are coded as
multiple single bytes. The most significant bit of the characters is
playing an important role in the encoding in the following way:
 
1.		ASCII characters in the range of 0 – 127 remain
unchanged.
2.		Characters with values higher than 127 start with a
character that has a number of bits on the left side set to one,
followed by a single 0 bit. The number of bits that are set to 1
determine the total number of bytes in the character e.g. 110 means two
bytes, 1110 means three bytes etc.
3.		All continuation bytes have their left two bits set to
‘10’
4.		The remaining unused bits contain the bits of the
character, written from left to right.
 
An example:
 
Byte 1		Byte 2
110 00011	10111100    decodes as     000 1111 1100  (Hexadecimal
FC or ü)
(byte one starts with 110, meaning a total of two bytes, byte 2 has the
left two bits set to 10 meaning that it is a continuation bit, the rest
contains the ü)

---

Since you use Java (I notice the .jsp extension on your repository), you
don't even have to worry about this.  When the OutputStreamWriter for
your response is created, you can tell it which encoding to use.  For
example:

PrintWriter out = new PrintWriter ( new OutputStreamWriter ( 
	response.getOutputStream (), "UTF8" ) );

Caveat Emptor:  This approach may eat up your CPU time.  To get around
it with the repository that I wrote, I stored the DC records as Unicode
byte-strings in a BLOB in my database.  To correctly write it back out,
I used a PrintStream with OutputStream.write (byte[], int, int) instead
of a PrintWriter.

I wish you the best of luck with your thesis!

Kind regards,
Henry Stern

---
Flon's Law:
	There is not now, and never will be, a language in
	which it is the least bit difficult to write bad programs.
 

> -----Original Message-----
> From: oai-implementers-admin@oaisrv.nsdl.cornell.edu 
> [mailto:oai-implementers-admin@oaisrv.nsdl.cornell.edu] On 
> Behalf Of Hussein Suleman
> Sent: August 17, 2001 1:28 PM
> To: NAVA M SANDRA EDITH
> Cc: oai-implementers@oaisrv.nsdl.cornell.edu
> Subject: Re: [OAI-implementers] Error in Request:GetRecord
> 
> 
> hi
> 
> NAVA M SANDRA EDITH wrote:
> > now i have a problem with the GetRecord request, i try to use as 
> > metadata format xml, and i have defined my xml.xsd, but 
> when i checked 
> > in Repository Explorer i have an error:
> 
> its a very common pitfall ... your XML is in UTF-8 but you 
> have a Latin-1 entity in your author field ...
> 
>   <author>Issa Paola V&aacute;zquez Guti&eacute;rrez</author>
> 
> for maximum portability, it is recommended that you convert 
> the Latin-1 entities to Unicode (if you use Perl, as part of 
> my Perl OAI-DP implementation available from the OAI website 
> there is a Utility.pm module that addresses lots of XML 
> issues, including this conversion)
> 
> ultimately you want to get something like:
>   <author>Issa Paola V&#x00E1;zquez Guti&#x00E9;rrez</author>
> 
> a cheap alternative is to escape all ampersands to pass the 
> Latin-1 entities unconverted ... but thats cheating :)
> 
> ttfn
> ----hussein
> 
> -- 
> ==============================================================
> ==========
> hussein suleman -- hussein@vt.edu -- vtcs -- 
> http://purl.org/net/hussein 
> ==============================================================
> ==========
> _______________________________________________
> OAI-implementers mailing list OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>