At Tue, 10 Feb 2004 13:06:28 -0800 (PST), Tom Lord wrote: > > There is an easy example of why such a category is desirable in > computing. Let's suppose that I'm going to specify the lexical syntax > of identifiers in a programming language. As part of that > specification, I'll need to identify this category. (For an example, > see "Unicode Technical Report #31: Identifier and Pattern Syntax", > http://www.unicode.org/reports/tr31/tr31-2.html) We may want to take that report with a grain of salt for Scheme. A simpler approach would be to define Scheme identifiers as everything _excluding_ the reserved punctuation characters, optionally allowing Unicode variations on those characters and extending the definition of whitespace. Most Schemes already work in this manner, despite the fact that R5RS uses an inclusive list. With a quick check, the *only* Scheme I found that doesn't let me enter and use arbitrary high-bit UTF-8 identifier names is Kawa, regardless of the Scheme's internal encoding. [checked Bigloo, Chez, Chicken, Gambit, Gauche, Guile, MIT Scheme, MzScheme, SCM and SISC] > In their wisdom (or absense of wisdom) the Unicode consortium chose a > name for this category: they call these characters "letters". That > _is_ an overloading of the term "letter" -- but it is an overloading > that pervades the Unicode specifications and data tables. For > example, every assigned Unicode codepoint has a property called "the > major class of its General Category". The class of alphabetic, > syllabic, and ideographic characters has the major class "L" (short > for "letter"). I apologize, I was mistaken. I was mostly going off of the official names of the characters, which consistently only uses "letter" for alphabets. It seems strange to me to call an ideograph a letter, but if Unicode officially uses that definition I'm not going to fight it. Unicode also uses alphabetic to describe syllabic characters, and does not provide any "syllabic" property. > Alex also writes: > > > "Ideograph" applied to all Han characters is technically > > incorrect. Linguists prefer the term "sinogram" which refers to > > Chinese-derived characters. "Sinogram" fits all uses being > > applied to the term "ideograph" in these discussions (at least > > until Unicode adds hieroglyphs). Since the usage of ideograph > > is fairly ubiquitous, however, it may not be worth fighting it. > > I have an intellectual curiosity about why you say that "ideograph" > is inaccurate. There are four general classifications of Chinese characters (from Kenneth Henshall's _A_Guide_To_Remembering_Japanese_Characters_): # cut&paste into utf-8 terminal for reference gosh -E'map(lambda(x y)(format #t"~A (~04X): ~A\n"(ucs->char x)x y)) `(#x6728 #x5C71 #x99AC #x4E0A #x56DE #x5CE0 #x6CE8) `(tree mountain horse up around mountain-pass pour)' -Eexit 1) Pictograph. U+6728 and U+5C71 are simple stylized pictures of a tree and mountain respectfully. Though these are simple, some pictographs are stylized beyond easy recognition, such as U+99AC (horse). 2) Sign or Symbol. U+4E0A is a symbol showing the direction up. U+56DE is a stylized form of two concentric circles meaning "around". 3) Ideograph. U+5CE0 shows a mountain on the left (the "radical") with the symbols up and down stacked on the right, leading to the idea of "mountain pass". 4) Phonetic-Ideograph (or Semasio-Phonetic). Something like 85% of all modern Chinese characters fall into this group. U+6CE8 (pour) is made from the radical for water on the left, plus a character with the same sound as a character meaning continuous, thus continuous flow of water, a reference to pouring. It's not always clear what category a character falls into, and this is mostly of interest to historians anyway. Unicode itself consistently refers to all Chinese characters as ideographs, even though most of them are much more complex, so I'm not even objecting to this term, I was just nit-picking. Also, the reference to this in the Unicode section 11.1 is the only place I've seen the term "sinogram," (most references use just "Chinese character" or "Kanji"). > I do note that Han characters are not the only ideographic letters > encoded in Unicode -- although I'm not sure there is a huge future in > writing Scheme programs whose identifiers are spelled using the Linear > B script :-) Now this gets weird. The Unicode standard consistently refers to the Linear B characters as "ideograms," the same meaning as "ideograph" but for no apparent reason uses different word. And they don't have the ideographic property: gosh> (any (cut char-set-contains? char-set:ideographic <>) (map integer->char (map (cut + #x10080 <>) (iota #x100)))) #f Indeed, the only characters with the ideographic property are the Han characters (from PropList-4.0.0.txt): ------------------------------------------------------------------------ 3006 ; Ideographic # Lo IDEOGRAPHIC CLOSING MARK 3007 ; Ideographic # Nl IDEOGRAPHIC NUMBER ZERO 3021..3029 ; Ideographic # Nl [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE 3038..303A ; Ideographic # Nl [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY 3400..4DB5 ; Ideographic # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5 4E00..9FA5 ; Ideographic # Lo [20902] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FA5 F900..FA2D ; Ideographic # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA2D 20000..2A6D6 ; Ideographic # Lo [42711] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6D6 2F800..2FA1D ; Ideographic # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPATIBILITY IDEOGRAPH-2FA1D # Total code points: 71053 ------------------------------------------------------------------------ Perhaps we should consider this a bug in the Unicode specification? -- Alex