Re: introduction Bradd W. Szonye 11 Feb 2004 00:26 UTC

> Tom Lord wrote:
>> [*] What exactly is a "Unicode character?"  The answer can vary
>>     depending on context.  In some contexts it might mean a Unicode
>>     abstract character -- the kind of value to which a codepoint
>>     (integer in the range 0..10ffff) is assigned.  In other contexts,
>>     it may mean certain kinds of sequences of abstract characters.
>>
>>     One goal for SRFI-52 is to remain agnostic about the answer
>>     to that question.

Robby Findler wrote:
> I'm still relatively new to unicode, so I apologize if this is a
> foolish question (rtfm ptrs welcome!), but I wonder why you would want
> to remain agnostic on this point. Can you explain why unicode-code
> points would be a bad choice, and what other choices might exist?

Short version: In general, a single character on your screen may
actually be made of several Unicode code points. For example, the
grapheme[*] é (small E with acute accent) can be encoded as a base
character (small E) plus a combining mark (acute accent).

Most internal Unicode encodings use code points as the basic "character"
unit. In those systems, the letter é is one symbol on screen but two
"character" units in memory. Other systems combine the code points much
earlier, such that é is only one "character" unit both on-screen and
in-memory. (For example, Bear's scheme stores characters as bignums with
each code point stored as a "big digit.")

There are advantages and disadvantages to both approaches. The "unit is
code point" method makes string indexing and mutation more difficult,
and it makes procedures like char-upcase nonsensical (because a
character is only a partial thing, in general). The "unit is grapheme"
approach avoids most of that -- although letters like ß are still a
problem for case-folding -- but generally requires more space to store
the same data.

[*] "Grapheme" is the name for "what humans think of when you talk about
    characters," more or less.
--
Bradd W. Szonye
http://www.szonye.com/bradd