[OAI-implementers] implementation of non-English characters w/UTF-8?

Chris Wilper cwilper at cs.cornell.edu
Tue Sep 13 16:40:23 EDT 2005


Hi Jewel,

UTF-8 can handle any Unicode character ("e" with an accent, and thousands
more from many languages).  As long as the encoding of the your characters
constitute valid UTF-8, you should be set.  The problem often arises when you
think you have UTF-8 to begin with, but your source data is actually using
some other encoding.  Often the problem isn't apparent until you get to the
non-ascii characters because several different encodings represent
"low-ascii" in the same way (the first few bytes).  It sounds like that might
be what's happening in your case.

If so, the best thing to do (and this is sometimes really hard) is to find
out what encoding the original provider of the file used.  If you know that,
then you can convert it to UTF-8 using a tool designed for that job[1].   If
you're unable to determine what the original encoding was, you can at least
make the file validate by replacing the odd characters with valid (though,
probably incorrect) UTF-8 ones[2].

- Chris

[1] Like this one that google told me about:
http://www.chilkatsoft.com/CharsetStudio.asp
[2] Simeon here at Cornell wrote a nice utility for this:
http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/

-----Original Message-----
From: oai-implementers-bounces at openarchives.org on behalf of Jewel Ward
Sent: Tue 9/13/2005 3:29 PM
To: OAI-implementers
Subject: [OAI-implementers] implementation of non-English characters w/UTF-8?
 

How have other people implemented "non-UTF-8" characters in their DP 
records?

Meaning, we have non-English characters that are "choking" when we test 
our Data Provider.  [Think "e" with the accent over it 
http://lib-app1.usc.edu:8085/oaidp?verb=GetRecord&identifier=oai:usc:digitala
rchive:bhe/bhe-m27&metadataPrefix=oai_dc 
(surname after first name of "Elmo").]  Eventually, we will have several 
Asian language character sets, as well as the current non-English 
characters.

I have looked over the protocol, looked at various tutorials, the 
oai-implementers archives, and the OAI Best Practices site, and have not 
seen any guidelines other than this thread:

http://www.openarchives.org/pipermail/oai-implementers/2001-April/000093.html

I'm also looking at OLAC and some of the DP implementations in Japan, 
but have not [yet] found the solution.  [Like this: 
http://mitizane.ll.chiba-u.jp/cgi-bin/oai/oai2.0?verb=ListRecords&metadataPre
fix=oai_dc 
.]

Will we just have to locate the individual characters that are choking 
and encode those a specific way?

Thanks in advance,

Jewel

-- 
Jewel H. Ward
Program Manager, USC Digital Archive
Leavey Library, Information Services Division
University of Southern California
Tel: (213) 821-2298   Cell: (213) 219-2784

_______________________________________________
OAI-implementers mailing list
List information, archives, preferences and to unsubscribe:
http://www.openarchives.org/mailman/listinfo/oai-implementers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://openarchives.org/pipermail/oai-implementers/attachments/20050913/ec6029a8/attachment.htm


More information about the OAI-implementers mailing list