Re: Why are byte ports "ports" as such? John Cowan 24 May 2006 12:35 UTC

Jorgen Schaefer scripsit:

> The argument for the latter is that, in Unicode, a "character" (a
> vague term, as John Cowan repeatedly pointed out) might very well
> be a number of code points, so you need to store something like a
> string anyways. This is the idea that a "character" is a grapheme
> cluster. It's of course trivial to provide an API for information
> about the first (nth) grapheme cluster in a string, which an
> editor can use to provide Emacs' C-x = feature.

The trouble is, it's far from clear whether grapheme clusters have
much to do with what users see as characters (assuming that users
do have a uniform vision on this point, which is far from certain).

Consider the Devanagari sequence usually transliterated "kshe".  This
is four codepoints (KA, VIRAMA, SHA, E), two grapheme clusters
(KA-VIRAMA, SHA-E), conceptually either two letters (KSHA, E) or
three (KA, SHA, E), and is rendered as a single glyph.

How many "characters" does it consist of?

> The argument for the former is that Unicode does specify a
> smallest component, a code point, and so far, the smallest
> component of a "character set" has been called "character". That
> is, a "character" is a "code point". This can also be seen as
> being a bit "cleaner", implementation-wise: A string consists of
> characters. We have data types for both. Contrast this to "a
> string consists of a number of substrings of length 1".

Languages (varying from Basic to Q) that take the no-characters
perspective don't think a string *consists* of anything, any more
than 15 *consists* of 3 x 5, though that is its unique prime
factorization.  In Q, indeed, Character is a subtype of String,
which can be informally characterized as "strings with only one
codepoint".

> | Despite this complexity, most things that a literate human would
> | call a ``character'' can be represented by a single code point
> | in Unicode (though there may exist code-point sequences that
> | represent that same character). For example, Roman letters,
> | Cyrillic letters, Hebrew consonants, and most Chinese characters
> | fall into this category. Thus, the ``code point'' approximation
> | of ``character'' works well for many purposes. It is thus
> | appropriate to define Scheme characters as Unicode scalar values

Well, those examples are well-chosen from the fairly simple cases.
They gloss over, for example, the fact that Hindi is conceptualized
by people who read and write it as an alphabet, whereas the structurally
parallel Tamil is conceptualized as a syllabary (so KA + U is two
letters in Hindi, one in Tamil -- and two codepoints and one default
grapheme cluster and one glyph in both cases).  Also, while it's clear
that Hebrew consonants are within the Hebrew reader's notion of characters,
it's not so clear about Hebrew vowel signs, which traditionally --
except when writing the Bible -- are treated as optional assistants.

It's not Unicode's *encoding* that makes it complicated; it's the
*repertoire*, which is complicated because the world of writing is
complicated.

--
Some people open all the Windows;       John Cowan
wise wives welcome the spring           xxxxxx@ccil.org
by moving the Unix.                     http://www.ccil.org/~cowan
  --ad for Unix Book Units (U.K.)
        (see http://cm.bell-labs.com/cm/cs/who/dmr/unix3image.gif)