case mappings, character sets & character properties
Alex Shinn 10 Feb 2004 03:21 UTC
First, a few notes on terminology.
"Letter," by all standard definitions and consistent with Unicode
usage, specifically refers an element of an alphabet. It therefore
would not apply to syllabic or ideographic characters.
"Ideograph" applied to all Han characters is technically incorrect.
Linguists prefer the term "sinogram" which refers to Chinese-derived
characters. "Sinogram" fits all uses being applied to the term
"ideograph" in these discussions (at least until Unicode adds
hieroglyphs). Since the usage of ideograph is fairly ubiquitous,
however, it may not be worth fighting it.
The character property suggested by char-letter? as the union of
alphabetic, syllabic and ideographic characters seems roughly equal with
the natural language (non-computer-encoded) notion of "character." It
should probably be named something like char-linguistic-character?.
This is vague and will almost certainly be handled by lookup tables of
Unicode data - I don't think we need this for basic Scheme text
processing.
The concept of case is orthogonal to being alphabetic. There are
alphabetic characters with no case, and (Unicode-classified) symbols
which are given case mappings such as Circled-A (U+24B6).
Defining anything in terms of character level case procedures seems like
a bad idea, since any individual character can map to 0-3 characters
(German eszett, although the most famous, is not by any means the only
exception here). However, because Scheme itself and many formats and
protocols make use of basic ASCII case operations, it is worthwhile to
include these in the Scheme core. A possible way to break these up is:
char-* => core Scheme character case-mapping (ASCII-only)
string-* => SRFI-13 string case-mapping (ASCII-only)
text-* => SRFI-XX full linguistic string case-mapping w/ locale
For Schemes that wish to provide a full linguistic folding of
identifiers, you definitely want some sort of locale-neutral folding. I
posted the general possible combinations on c.l.s. earlier. Unicode
does define locale-neutral case-foldings which are a subset of those
combinations - they break it down into whether or not you unify Turkish
i (and ignore other accent marks), and whether or not to allow folding
to more than one character (as an optimization). The "one-character"
folding seems fairly arbitrary and undesirable if you're going the whole
hog anyway. Regardless of the folding, I like the string->symbol-name
idea.
Core Unicode character properties can be provided as SRFI-14 char-sets.
Additional properties may be better provided as introspection on the UCD
(Unicode Character Database). The question remains how to handle R5RS
character predicates related to these values:
* char-alphabetic? char
* char-numeric? char
* char-whitespace? char ; rename to char-white-space? please!
* char-upper-case? char ; rename to char-uppercase? please!
* char-lower-case? char ; rename to char-lowercase? please!
As mentioned, it can be useful to have these functioning on pure ASCII
for use in parsers and tools for common protocols. Moreover, the
Unicode equivalents are often very expensive (if not in time then in
space). Should a Scheme that wants to provide the full Unicode
equivalents of these extend the core procedures or should we define
disjoint procedures such as
* char-unicode-alphabetic?
--
Alex