[OAI-implementers] Special characters, UNICODE, and OAI tools

Mon, 12 Feb 2001 19:17:21 -0500 (EST)

I am exploring what is happening to special characters through a
variety of browsers, Hussein's Repository Explorer, and ARC.  This is in
part to have a reliable way to check the results of our mapping and in
part to be able to explain to others why things look odd when they try
them.

Hussein, 

Can you confirm that you are doing nothing to the UNICODE entities in your
Raw XML view?  That's what it looks like if I look at the page source.  

My Netscape 4.7 for Windows (under Windows 95) appears to replace the &#
by ? before displaying the entity.  My Mac version of Netscape appears to
handle some of the entities (i and a with acute accents, for example) but
not others.  Internet Explorer 5.5 for Windows is handling all the
characters I have looked at so far.  [This looks like the best approach
for checking mapping of known characters.]

What do you do in the parsed view?  I'm getting strange character
combinations in both Netscape 4.7 and Internet Explorer 5.5.

Liu,    

Several (although not all) special characters are coming through when I
use ARC with Netscape 4.7 on Windows.  Internet 5.5 doesn't do any better
than Netscape 4.7.  Also, not coming through are a few "XML sanity"
entities, which we have been expressing as "old-fashioned" character
entities.  I don't claim to be an XML character encoding expert; for OAI
we accepted the recommendation of our standards office to keep using this
handful of character entities (e.g. &apos;) in that form.  What do others
think the practice should be on these?  They presumably validate against
the schema because they get through Hussein's Explorer.

Sample GetRecord URLs that show the issues are:

http://memory.loc.gov/cgi-bin/oai1_0?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lcoa1:loc.music/musdi.213

  Title includes apostrophe in     d'une

and 

http://memory.loc.gov/cgi-bin/oai1_0?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lcoa1:loc.music/musdi.215

  4 special czech characters (regular letters with diacritics)

   Any thoughts and experiences welcome.  

   Thanks.                       Caroline Arms              caar@loc.gov
                                 National Digital Library Program
                                 Library of Congress