Unicode and Scheme
Tom Lord
(07 Feb 2004 22:33 UTC)
|
Permitting and Supporting Extended Character Sets: response.
bear
(09 Feb 2004 05:03 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(09 Feb 2004 17:00 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
bear
(09 Feb 2004 20:42 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response. Tom Lord (09 Feb 2004 21:55 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
bear
(10 Feb 2004 00:23 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(10 Feb 2004 00:33 UTC)
|
Re: Unicode and Scheme
bear
(09 Feb 2004 05:26 UTC)
|
Re: Unicode and Scheme
Tom Lord
(09 Feb 2004 17:15 UTC)
|
Re: Unicode and Scheme
bear
(09 Feb 2004 20:47 UTC)
|
> From: bear <xxxxxx@sonic.net> > The result of case-mapping via char-ci=? only on cased characters is > that distinct identifiers written using these characters remain > distinct no matter what the preferred case of the implementation > is. That's the desirable, crucial property that I was trying to > capture with the distinction between cased and uncased characters. I don't see why that property is crucial. Here's where I am on these thing: please have a look at the "References" section of the "Unicode Identifiers" draft. The consortium has made recommendations for case-insensitive programming languages. I think we should follow those and I don't think that they're consistent with what you are adocating. [re: symbol->string as applied to symbols returned by READ or by a symbol literal in source text] > Actually, the wording you proposed permits *any* character in the > string to be in the non-preferred case, as long as the string contains > one or more characters which are in the preferred case. Not quite. Cummulatively, the set of changes I've proposed: 1) require that for symbols with portable names (in which, therefore, all non-punctuation characters are latin letters), either all the letters are uppercase or they are all lowercase -- just as in R5RS. 2) are deliberately vague about the meaning "preferred case" for extended characters. In this sense, yes -- an implementation can include (extended) characters in the opposite of the preferred case in the symbol name of a symbol returned by READ. I believe that this permissiveness is essential for Unicode support unless we are to require implementations to restrict the lexical syntax of identifiers far beyond the restrictions recommended by Unicode. > I chose the properties of characters I called "cased" and "uncased" > carefully; the distinctions they make are necessary and sufficient to > allow implementations to detect which characters can safely be > regarded as cased characters in the normal sense, I assume that you mean (Scheme) programs, not implementations. Programs can already detect which characters are naively cased in the sense of your terms. That you are able to define your CHAR-CASED in a few lines of R5RS illustrates that. Your two procedures would be useful if they were, in fact, needed for expository purposes elsewhere in teh spec -- but the two uses you put them to were both needless. >> For example, a Unicode STRING->SYMBOL must not canonicalize its >> argument (and STRING=? is a codepoint-wise comparison). > No, string=? is, and should be, a character-wise comparison. The only > ways to make it into a codepoint-wise comparison are to make > non-canonical strings inexpressible, or to expressly forbid conforming > with the unicode consortium's requirement which says that if the > combining codepoints _within_each_particular_combining_class_ > following a base character are the same codepoints in the same > sequence, then the result "ought to be regarded as the same character" > regardless of the sequence of individual codepoints. Since you (elsewhere) agree that: (string=? s (symbol->string (string->symbol s))) => #t this disagreement is immaterial for the "Permitting and Supporting Extended Character Sets" draft. Please consider me to have written: For example, a Unicode STRING->SYMBOL _may_ wish to not canonicalize ..... and my point stands. You'll want to take up this issue separately, in response to "Scheme Characters as (Extended) Unicode Codepoints", I think. > IOW, because Macron and Cedilla are in different combining > classes, the sequences A, Macron, Cedilla and A, Cedilla, Macron > ought to be regarded as equal in a string comparison. Not by STRING=? in a Scheme in which the strings are regarded as codepoint sequences, since STRING=? is the equivalence relation induced by CHAR=?. And, incidentally, although that STRING=? is not the linguistically sensitive string-equality relation that Unicode defines, it _is_ a useful procedure to have around for _implementing_ Unicode text processes. > The view of characters as multi-codepoint sequences is the only way I > could find to comply with both this requirement in Unicode and the > R5RS requirement of string=? as a character-by-character comparison. A simpler way is to say that STRING=? and the Unicode equivalence relationship you have in mind are separate procedures. That some Unicode text processes are defined in terms of a codepoint-wise STRING=?-style comparison is one reason why I like the design in "Scheme Characters as (Extended) Unicode Codepoints". >> It's true that that is not how linguists use the term "alphabetic". >> It's also true that not all "letters", in the sense of Unicode, are >> alphabetic characters. For example, ideographic characters are >> categorized in Unicode as "letters"; syllabaries are classified as >> "letters". > True. If we are to keep char-alphabetic?, then perhaps we ought to > also proposing the addition of char-ideographic? and char-syllabic? > then char-letter? could be defined as > (define char-letter? (lambda (c) > (or (char-alphabetic? c) > (char-ideographic? c) > (char-syllabic? c)))) > Under that system, > 1 - all four terms would be consistent with the use linguists make of > them. > 2 - the more general class of 'letter' would be available, and properly > defined, for unicode text processing. > 3 - the usefulness of primitives for parsing source text and other > simple syntaxes would be preserved and expanded. _IF_ it were possible to define CHAR-ALPHABETIC? in a way which was both linguistically correct _and_ upwards compatible with R5RS then perhaps that would be almost a good idea. I say "almost" because CHAR-IDEOGRAPHIC? and CHAR-SYLLABIC? add bloat and those plus CHAR-ALPHABETIC? fails to be a complete enumeration of letter types.... But CHAR-ALPHABETIC? is just a botch. It can not be rescued. All of these character classes belong elsewhere, with different names -- in a "Linguistic Text Processing" SRFI. >> Worse, a linguistically proper definition of CHAR-ALPHABETIC? >> would be upwards incompatible with R5RS which requires that >> alphabetic characters have upper and lowercase forms (which are >> themselves characters). > I think that it's better and simpler to simply decouple that > requirement from "alphabetic-ness" in R*RS. The few things that > depend on it can be explicitly defined only on "cased" characters, > whatever you want to call them, with no damage to anything written > using the portable character set. But in that case we need a > predicate to detect cased characters, and char-alphabetic? ain't > it. A predicate to detect "cased" characters can be trivially synthesized from CHAR-UPCASE, CHAR-DOWNCASE, and CHAR=?. I see no need for it to be required by R6RS. Breaking CHAR-ALPHABETIC? in the way that you propose will not break correct protable programs whose _input_data_ consists only of portable characters, but it can break correct portable programs whose input data includes extended characters. There is no particular reason to introduce that breakage. >> Now, having said all of that, the definition of CHAR-ALPHABETIC? could >> be improved: The possibilitiy of non-alphabetic letters with both >> upper and lowercase forms seems plausble to me (are there any in >> Unicode already?) > Some of the native american syllabaries ("CANADIAN SYLLABICS > ..." in the Unicode standard) are frequently written using > double-sized glyphs to start sentences; Unicode does not > currently recognize these as separate characters, calling them a > presentation form instead. Technical disagreements aside, it's learning things like that (I'm taking you at your word for now :-) makes Unicode work so much fun, isn't it? Along those lines: one of the Pika hackers recently stumbled across U+0F33 ("TIBETAN DIGIT HALF ZERO") which is given the numeric value "-1/2". There's U+0F32 ("TIBETAN DIGIT HALF NINE") given the numeric value "17/2" and some other mysteries. I'm looking forward to learning the notational system these are from :-) >> So, instead of that definition of CHAR-ALPHABETIC? >> I would agree to: >> CHAR-ALPHABETIC? must be defined in such a way that >> this is true of all characters: >> (or (not (char-alphabetic? c)) >> (and (char-letter? c) >> (char-upper-case? (char-upcase c)) >> (char-lower-case? (char-downcase c)))) >> => #t > This is nonsense. Hebrew characters are clearly alphabetic, but just > as clearly have no concept of upper case and lower case. The property > you are looking for here is whether a character is cased, and using > the word "alphabetic" to refer to that property will lead people astray. No. You are thinking that I am trying to make make CHAR-ALPHABETIC? linguistically useful. What I'm actually trying to do is to minimize the degree to which CHAR-ALPHABETIC? is linguistically useless. The invariant above is in that spirit. The requirements in R5RS for CHAR-ALPHABETIC? already make it linguistic nonsense. There's no hope for it. Deprecating it is the best thing. > Further, your definition does not capture the full range of what you > need to express when checking for this property; characters such as > dotless-i will be char-alphabetic? according to the definition above > while still capable of causing bugs with char-ci=? and case-blind > identifiers because they are not the preferred lowercase mappings of > their own preferred uppercase mappings. I'm following the letter of the (deprecated, stupid) law. R5RS does _not_ require, _even_for_ CHAR-ALPHABETIC? _characters_, that: (char=? (char-downcase c) (char-downcase (char-upcase c))) => #t Amazing but true. > All the latin alphabetic characters are included in the set of > cased characters, just as they are included in the worldwide set > of alphabetic characters. What we are doing here is moving to a > superset of the currently defined set, so there is no more > upward compatibility issue in going to one superset than in > going to another. If the case requirements in R5RS are read as > applying to _cased_ characters, then all code presently extant > is conforming. If the case requirements in R5RS are read as > applying to _alphabetic_ characters, then all code presently > extant is conforming. There is no need to introduce the (linguistically random) notion of "cased character". With the invariant I gave for CHAR-ALPHABETIC?, correct, portable R5RS programs remain so. > I think the invariant you're trying to attach to char-alphabetic? does > not belong there. Past standards writers have been looking at > [...] Water under the bridge. CHAR-ALPHABETIC? is broken. R6RS should not attempt to provide comprehensive facilities for Unicode text processing. It should attempt to provide a minimum of upward compatible character and string facilities which are a useful _subset_ of Unicode text processing, close in informal meaning to what the R5RS versions say. My proposal does that. > Past standards writers have been looking at restricted set of > characters in which all alphabetic characters were also cased, > and they made a requirement which is appropriate only to cased > characters, mistakenly calling the class of characters it should > be applied to "alphabetic" because there were no counterexamples > in the set of characters under consideration. The requirement > is valuable, and we should keep it, but we need to apply it to > the set of characters to which it properly belongs, and simply > accept the fact that not all alphabetic characters are cased. The CHAR-ALPHABETIC? invariant that I gave is consistent with an implementation that defines it for truly alphabetic characters that are "cased" in the sense you have been using. It's consistent with R5RS. It's a hopeless cause to try to require more from CHAR-ALPHABETIC? than that and deprecating CHAR-ALPHABETIC? is necessary. > I believe that > (char-ci=? #\I #\dotless-i) => #f > Because > (char=? (char-downcase #\I) #\dotless-i) => #f. It's interesting that you're advocating a behavior which is contrary to Unicode recommendations _and_ not required by R5RS. > If you have a system in which #\dotless-i and #\i are both treated as > cased characters whose uppercase is #\I, then two identifiers, one > written using a dotted lowercase i and one written using a dotless i, > can be confused with one another in an implementation whose preferred > case is uppercase. This is a topic for discussion of the draft called "Unicode Identifiers". As I say in that draft: I need to do a bit more investigation at the library but I did look into this specific issue when I wrote the draft. _As_I_recall_ (so take it with a grain of salt), the Unicode recommendations for case-insensitive programming-language identifiers say that: I and i are the same identifier but that: <dotless i> is a different identifier. Go figure. > #\dotless-i and #\I therefore ought not be regarded > as char-ci=? in any system which also regards #\i and #\I as > char-ci=?. In my proposed revisions for R6RS, CHAR-CI=? has no relationship to identifier equivalence _except_ for identifiers spelled using only portable characters. I don't think there is any other choice there that is consistent with Unicode best practices. > It is true that > (char=? (char-upcase #\dotless-i) (char-upcase #\i) #\I) => #t, > But given that you want to require > (char=? (char-downcase #\I) #\dotless-i) => #f and > (char=? (char-downcase #\I) #\i) => #t, > it is clearly unsupportable to choose #\dotless-i over #\i as the > lower case character which is char-ci=? to #\I. Yup. And, indeed, that seems to be the recommendation from the Unicode Consortium. "Things should be as simple as possible ...." -t ---- Like my work on GNU arch, Pika Scheme, and other technical contributions to the public sphere? Show your support! https://www.paypal.com/xclick/business=lord%40emf.net&item_name=support+for+arch+and+other+free+software+efforts+by+tom+lord&no_note=1&tax=0¤cy_code=USD and xxxxxx@emf.net for www.moneybookers.com payments.