[OAI-implementers] Message 2 on mapping from MARC/ALA to Unicode

Caroline Arms caar@loc.gov
Sun, 8 Jul 2001 19:00:01 -0400 (EDT)

This is a follow-up message to the earlier internal message I forwarded 
about character conversion from MARC records.  The statistics mentioned
are based on records in the Library of Congress catalog.

       Caroline Arms                               caar@loc.gov
       National Digital Library Program
       Library of Congress


WGL4 Fonts
        For the last several years Microsoft has been promoting a
character set regarded as a PanEuropean character set consisting of
some 652 characters. That is, this character set is far more extensive
than what we used to know as Latin-1 or Windows ANSI or ANSEL
character set. This is called the Windows Glyph List 4 character set
and includes many special characters, diacritics, and precomposed  
diacritics with characters.  It also includes Greek and Cyrillic.
WGL4 is a proper subset of UNICODE, and each of the characters is
known by its UNICODE codepoint.
        Fonts which are restricted to the WGL4 character set are much
smaller that 'full UNICODE' fonts and are therefore much less
cumbersome and tend to  make loading and display faster.  The WGL4 is
the defacto standard for all recent or new Windows computers, and,
going forward at least, most or all computers can be expected to have
all of the WGL4 characters automatically or off-the-shelf without
having to do  any downloads or special font installations.
        Each of the current versions of Microsoft's Core Fonts for the
Web use the WGL4 character set.  Core Fonts include Andale Mono,
Trebuchet MS, Georgia, Verdana, Arial, Arial Black, Impact, Times New
Roman, and Courier New.  These are  professionally designed and
carefully drawn fonts.  You can go to
to check to see if your version numbers of these fonts are the current    
ones, and to download and install an upgrade if necessary.  Try Times
New Roman and Courier New for starters; the current version number on
each of these is v. 2.82.
        Try one or more of these WGL4 fonts in the ALA to UNICODE
mapping tests, either the browser one from yesterday or the Access one
from earlier today.  Get a sense for how many of the codepoints are
covered by WGL4.
        Jim Agenbroad has recently taken a look at a character set
analysis produced by my old JLGCHR program a couple of years ago. This
analysis covered only romanized data, not vernacular.  He went through
the character counts (where a character is either unaccented or is a
character-with-diacritics) to determine which ones are included in the
WGL4 character set.  That is, he was looking at the bibliographic
character set as it actually occurs in the data.
        Initial conclusion:  99.12% of the bib data is restricted to
the WGL4 character set.  Put another way, less than 1% of the bib data
needs characters that are outside of the WGL4 character set, and           
therefore less than 1% of the data requires anything beyond the basic
fonts that are standard on current computers.

- - Jim Godwin