[OAI-implementers] implementation of non-English characters w/UTF-8?

Simeon Warner simeon at cs.cornell.edu
Tue Sep 13 16:47:09 EDT 2005


Hi Jewel,

Perhaps I'm not understanding your question properly, but I think the
bottom line is that you have to convert whatever it is you have into
Unicode. When you know the Unicode code point you can then either write an
XML numeric entity (e.g. ü for code point 252(decimal)) or encode it
as a two-byte UTF-8 sequence.

In arXiv, for example, we have legacy data that uses TeX escape sequences
to represent non-ASCII characters in author names and such. Our OAI
interface has code to convert these to numeric entities. For example, the
TeX escape sequence \"u (u-umlaut) is Unicode code point 0x00FC, or 252 in
decimal which is represented as "ü" in the XML. You can see this in
the response from:

http://arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:hep-th/9901140&metadataPrefix=oai_dc

My choice to use the somewhat less efficient numeric entities rather than
direct UTF-8 was motivated by debugging ease on systems where tools for
UTF-8 were not really mature.

The site you refer to, http://mitizane.ll.chiba-u.jp/, correctly uses
direct UTF-8 encodings as far as I can see. We need never know what
internal format they use (ah! the beauty of standard interfaces...).

If you have problems locating UTF-8 encoding errors in XML you might find
my little utf8conditioner program helpful:
http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/

Cheers,
Simeon


On Tue, 13 Sep 2005, Jewel Ward wrote:
> How have other people implemented "non-UTF-8" characters in their DP
> records?
>
> Meaning, we have non-English characters that are "choking" when we test
> our Data Provider.  [Think "e" with the accent over it
> http://lib-app1.usc.edu:8085/oaidp?verb=GetRecord&identifier=oai:usc:digitalarchive:bhe/bhe-m27&metadataPrefix=oai_dc
> (surname after first name of "Elmo").]  Eventually, we will have several
> Asian language character sets, as well as the current non-English
> characters.
>
> I have looked over the protocol, looked at various tutorials, the
> oai-implementers archives, and the OAI Best Practices site, and have not
> seen any guidelines other than this thread:
>
> http://www.openarchives.org/pipermail/oai-implementers/2001-April/000093.html
>
> I'm also looking at OLAC and some of the DP implementations in Japan,
> but have not [yet] found the solution.  [Like this:
> http://mitizane.ll.chiba-u.jp/cgi-bin/oai/oai2.0?verb=ListRecords&metadataPrefix=oai_dc
> .]
>
> Will we just have to locate the individual characters that are choking
> and encode those a specific way?
>
> Thanks in advance,
>
> Jewel
>
> --
> Jewel H. Ward
> Program Manager, USC Digital Archive
> Leavey Library, Information Services Division
> University of Southern California
> Tel: (213) 821-2298   Cell: (213) 219-2784
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://www.openarchives.org/mailman/listinfo/oai-implementers
>



More information about the OAI-implementers mailing list