[OAI-implementers] Message 2 on mapping from MARC/ALA to Unicode
Sun, 8 Jul 2001 19:00:01 -0400 (EDT)
This is a follow-up message to the earlier internal message I forwarded
about character conversion from MARC records. The statistics mentioned
are based on records in the Library of Congress catalog.
Caroline Arms email@example.com
National Digital Library Program
Library of Congress
For the last several years Microsoft has been promoting a
character set regarded as a PanEuropean character set consisting of
some 652 characters. That is, this character set is far more extensive
than what we used to know as Latin-1 or Windows ANSI or ANSEL
character set. This is called the Windows Glyph List 4 character set
and includes many special characters, diacritics, and precomposed
diacritics with characters. It also includes Greek and Cyrillic.
WGL4 is a proper subset of UNICODE, and each of the characters is
known by its UNICODE codepoint.
Fonts which are restricted to the WGL4 character set are much
smaller that 'full UNICODE' fonts and are therefore much less
cumbersome and tend to make loading and display faster. The WGL4 is
the defacto standard for all recent or new Windows computers, and,
going forward at least, most or all computers can be expected to have
all of the WGL4 characters automatically or off-the-shelf without
having to do any downloads or special font installations.
Each of the current versions of Microsoft's Core Fonts for the
Web use the WGL4 character set. Core Fonts include Andale Mono,
Trebuchet MS, Georgia, Verdana, Arial, Arial Black, Impact, Times New
Roman, and Courier New. These are professionally designed and
carefully drawn fonts. You can go to
to check to see if your version numbers of these fonts are the current
ones, and to download and install an upgrade if necessary. Try Times
New Roman and Courier New for starters; the current version number on
each of these is v. 2.82.
Try one or more of these WGL4 fonts in the ALA to UNICODE
mapping tests, either the browser one from yesterday or the Access one
from earlier today. Get a sense for how many of the codepoints are
covered by WGL4.
Jim Agenbroad has recently taken a look at a character set
analysis produced by my old JLGCHR program a couple of years ago. This
analysis covered only romanized data, not vernacular. He went through
the character counts (where a character is either unaccented or is a
character-with-diacritics) to determine which ones are included in the
WGL4 character set. That is, he was looking at the bibliographic
character set as it actually occurs in the data.
Initial conclusion: 99.12% of the bib data is restricted to
the WGL4 character set. Put another way, less than 1% of the bib data
needs characters that are outside of the WGL4 character set, and
therefore less than 1% of the data requires anything beyond the basic
fonts that are standard on current computers.
- - Jim Godwin