Re: Why are byte ports "ports" as such?
John Cowan 23 May 2006 20:11 UTC
Jonathan S. Shapiro scripsit:
> The underlying issue within UNICODE is the existence of the so-called
> "combining characters". There exist characters that have no single
> defining codepoint. These exist primarily in Asian languages, for
> example in the form of multiple code points that together form a single
> "glyph".
In fact they are all over the place: you cannot write such a very
European language as Lithuanian, which uses the Latin script, without
employing them. (Well, you can write memos or to-do lists, but not
poetry or dictionaries.)
However, whether a "default grapheme cluster" (the Unicode name for a base
character together with its combining characters) is a "character" in the
non-technical sense depends on the culture. Is an "o" with a dot-above
accent and a macron accent a single "character"? Sure. How about a Hindi
consonant letter with associated vowel mark? Not at all: one sense of
"character" in Hindi covers consonants and vowels separately just as in
Latin, another sense is "run of consonants up to and including the next
vowel." What about Korean? Is a Hangul syllable one character or 2-3?
Depends on the context: sometimess one, sometimes the other.
"Character" is not a technical term in Unicode because it can't be; it
would have to match too many contradictory expectations. The Unicode
Glossary, which is not normative, says:
Character. (1) The smallest component of written language that
has semantic value; refers to the abstract meaning and/or shape,
rather than a specific shape (see also glyph), though in code
tables some form of visual representation is essential for the
reader's understanding. (2) Synonym for abstract character
[defined as "A unit of information used for the organization,
control, or representation of textual data. "]. (3) The basic
unit of encoding for the Unicode character encoding. (4) The
English name for the ideographic written elements of Chinese
origin. (See ideograph(2).)
There *are* technical terms in Unicode, like code unit, code point,
default grapheme cluster, and so on. Which of these should be mapped
to a given programming culture's pre-existing concept of "characters"
is a question which Unicode by itself cannot answer. So far, C has gone
for the 8-bit code unit interpretation, Java for the 16-bit code unit
interpretation, and XML for the code point interpretation.
(The Glossary is at http://www.unicode.org/glossary/ .)
--
Andrew Watt on Microsoft: John Cowan
Never in the field of human computing xxxxxx@ccil.org
has so much been paid by so many http://www.ccil.org/~cowan
to so few! (pace Winston Churchill)