Unicode and Scheme
Tom Lord
(07 Feb 2004 22:33 UTC)
|
Permitting and Supporting Extended Character Sets: response.
bear
(09 Feb 2004 05:03 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(09 Feb 2004 17:00 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response. bear (09 Feb 2004 20:42 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(09 Feb 2004 21:55 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
bear
(10 Feb 2004 00:23 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(10 Feb 2004 00:33 UTC)
|
Re: Unicode and Scheme
bear
(09 Feb 2004 05:26 UTC)
|
Re: Unicode and Scheme
Tom Lord
(09 Feb 2004 17:15 UTC)
|
Re: Unicode and Scheme
bear
(09 Feb 2004 20:47 UTC)
|
On Mon, 9 Feb 2004, Tom Lord wrote: >A while back, tb argued that the case-mapping procedures of R5RS could >simply be dropped. There's something to that. It's true, there's something. They are, at the very least, library rather than core procedures. Still, the idea of dropping them is a non-starter. >Why do that? I'm not convinced we should but the arguments for doing >so would include: > >~ it would remove from R5RS all traces of the naive approach > to character case This is not desirable, unless provided in a standard library. >~ it would remove from R5RS the culturally biased character > class "alphabetic" Likewise. >~ it would evaded the tricky problem of define "numeric" usefully > yet without cultural bias FWIW, I do not see how this would be achieved by the proposal nor why it could not otherwise be achieved. >~ the basic structure of the CHAR? type, a well-ordered set isomorphic > to a subset of the integers, would be retained Think carefully about your "subset." It is not a subrange, because there are many codepoints which are mapped to no character in unicode (ie, there are "gaps" in the set). Nor is it (necessarily) finite. If the subset is neither finite nor a subrange, there is very little value in such an isomorphic mapping. >Why not do it? > >~ pedagogical reasons -- for the portable character set, the > metacircularity procedures can be defined using the dropped > procedures >~ practical reason -- it wouldn't leave enough standard machinary in > Scheme to parse simple formats like "whitespace separated fields" Both of the above points boil down to, the definitions embodied by these procedures are used in other parts of the standard. If they were removed, then the additional information would need to be added to other parts of the standard. >~ practical reason -- implementors will want to provide all of the > procedures in the DROP column for years to come, at least. Useful > libraries will continue to rely on them. It is worthwhile to > (continue to) say what they should mean. And of course, this is a trump as far as I'm concerned. >But, on to the proposed revisions to the proposed revisions to the >revised^5 specification: > >> It should say: > > >> Returns the name of symbol as a string. [...] will be in the > >> implementation's preferred standard case [...] > >> will prefer upper case, others lower case. If the symbol was > >> returned by string->symbol, [....] string=? to > >> the string that was passed to string->symbol. [....] > > > > I would propose instead: > > > Returns the name of symbol as a string. [...] all cased > > characters in the identifier (see the definition of char-cased? > > for a precise definition of cased and uncased characters) will > > be in the implementation's preferred standard case [....]. If > > the symbol was returned by string->symbol, the case of the > > characters in the string returned will be the same as the case > > in the string that was passed to string->symbol. [....] > > > Rationale; I think it's simply clearer. The above wording > > specifically permits uncased characters (ie, characters which do not > > conform to "normal" expectations of cased characters) to be present > > in lowercase in identifiers even if the preferred case is uppercase, > > and presumably vice versa. > >Huh. I thought that my wording permitted that already. I mostly >dislike your wording. Actually, the wording you proposed permits *any* character in the string to be in the non-preferred case, as long as the string contains one or more characters which are in the preferred case. >This part: > > > all cased characters in the identifier [...] will be in the > > implementation's preferred standard case > >seems too strong to me. I'd be willing to accept it if (a) we nail a >good STRING->SYMBOL-NAME definition for the "Unicode Identifiers" >draft; (b) prove that the property you named is true for that >STRING->SYMBOL-NAME and for all future versions of Unicode. I chose the properties of characters I called "cased" and "uncased" carefully; the distinctions they make are necessary and sufficient to allow implementations to detect which characters can safely be regarded as cased characters in the normal sense, and also admit character sets which include linguistically uppercase or lowercase characters which do not have proper case mappings. If proper case mappings for such characters are added to the character set, they become "cased," and identifiers containing them can change form to comply with the implementation's preferred case; but this does not occur at the risk of distinctions between distinct identifers being lost. > >This part: > > > If the symbol was returned by string->symbol, the case of the > > characters in the string returned will be the same as the case > > in the string that was passed to string->symbol. > >is too weak. The two strings must be STRING=?. I think I'll agree with this point. The strings must, in fact, be string=?. > For example, a >Unicode STRING->SYMBOL must not canonicalize its argument (and >STRING=? is a codepoint-wise comparison). No, string=? is, and should be, a character-wise comparison. The only ways to make it into a codepoint-wise comparison are to make non-canonical strings inexpressible, or to expressly forbid conforming with the unicode consortium's requirement which says that if the combining codepoints _within_each_particular_combining_class_ following a base character are the same codepoints in the same sequence, then the result "ought to be regarded as the same character" regardless of the sequence of individual codepoints. IOW, because Macron and Cedilla are in different combining classes, the sequences A, Macron, Cedilla and A, Cedilla, Macron ought to be regarded as equal in a string comparison. The view of characters as multi-codepoint sequences is the only way I could find to comply with both this requirement in Unicode and the R5RS requirement of string=? as a character-by-character comparison. Further, both requirements are useful. > > This is not how linguists use the term "alphabetic." Please do > > not propose "alphabetic" as a procedure to use to mean this, as > > it will frustrate and confuse people. > >It's true that that is not how linguists use the term "alphabetic". > >It's also true that not all "letters", in the sense of Unicode, are >alphabetic characters. For example, ideographic characters are >categorized in Unicode as "letters"; syllabaries are classified as >"letters". True. If we are to keep char-alphabetic?, then perhaps we ought to also proposing the addition of char-ideographic? and char-syllabic? then char-letter? could be defined as (define char-letter? (lambda (c) (or (char-alphabetic? c) (char-ideographic? c) (char-syllabic? c)))) Under that system, 1 - all four terms would be consistent with the use linguists make of them. 2 - the more general class of 'letter' would be available, and properly defined, for unicode text processing. 3 - the usefulness of primitives for parsing source text and other simple syntaxes would be preserved and expanded. >Worse, a linguistically proper definition of CHAR-ALPHABETIC? would be >upwards incompatible with R5RS which requires that alphabetic >characters have upper and lowercase forms (which are themselves >characters). I think that it's better and simpler to simply decouple that requirement from "alphabetic-ness" in R*RS. The few things that depend on it can be explicitly defined only on "cased" characters, whatever you want to call them, with no damage to anything written using the portable character set. But in that case we need a predicate to detect cased characters, and char-alphabetic? ain't it. >Now, having said all of that, the definition of CHAR-ALPHABETIC? could >be improved: The possibilitiy of non-alphabetic letters with both >upper and lowercase forms seems plausble to me (are there any in >Unicode already?) Some of the native american syllabaries ("CANADIAN SYLLABICS ..." in the Unicode standard) are frequently written using double-sized glyphs to start sentences; Unicode does not currently recognize these as separate characters, calling them a presentation form instead. > So, instead of that definition of CHAR-ALPHABETIC? >I would agree to: > > CHAR-ALPHABETIC? must be defined in such a way that > this is true of all characters: > > (or (not (char-alphabetic? c)) > (and (char-letter? c) > (char-upper-case? (char-upcase c)) > (char-lower-case? (char-downcase c)))) > => #t This is nonsense. Hebrew characters are clearly alphabetic, but just as clearly have no concept of upper case and lower case. The property you are looking for here is whether a character is cased, and using the word "alphabetic" to refer to that property will lead people astray. Further, your definition does not capture the full range of what you need to express when checking for this property; characters such as dotless-i will be char-alphabetic? according to the definition above while still capable of causing bugs with char-ci=? and case-blind identifiers because they are not the preferred lowercase mappings of their own preferred uppercase mappings. All the latin alphabetic characters are included in the set of cased characters, just as they are included in the worldwide set of alphabetic characters. What we are doing here is moving to a superset of the currently defined set, so there is no more upward compatibility issue in going to one superset than in going to another. If the case requirements in R5RS are read as applying to _cased_ characters, then all code presently extant is conforming. If the case requirements in R5RS are read as applying to _alphabetic_ characters, then all code presently extant is conforming. >That's consistent with my proposed revisions. I think CHAR-LETTER? >ought to be added and CHAR-ALPHABETIC? either dropped entirely or >mentioned as deprecated. If it is mentioned as deprecated, the >invariant shown above should be stated here. The corresponding >sentence in the definition of CHAR-UPCASE and CHAR-DOWNCASE should be >dropped. I think the invariant you're trying to attach to char-alphabetic? does not belong there. Past standards writers have been looking at a restricted set of characters in which all alphabetic characters were also cased, and they made a requirement which is appropriate only to cased characters, mistakenly calling the class of characters it should be applied to "alphabetic" because there were no counterexamples in the set of characters under consideration. The requirement is valuable, and we should keep it, but we need to apply it to the set of characters to which it properly belongs, and simply accept the fact that not all alphabetic characters are cased. > > char-cased? > > char-uncased? > > > Char-cased? returns #t if its argument is a character which conforms to > > "normal" case expectations, (see below) and #f otherwise. [....] > > > Rationale: This allows char-lower-case?, char-upper-case?, and > > char-alphabetic? to go on meaning the same thing with respect to the > > 96-character portable character set and meaning the same thing > > linguists mean when they use these terms. This will reduce confusion > > in the long run. This particular notion of cased and uncased > > characters is also useful in other parts of the standard for saying > > exactly which characters case requirements should apply to. It leaves > > implementors free to not sweat about what to do with identifiers > > containing eszett, regardless of what they do with calls to > > (char-upcase #\eszett). > >Among the rationales: I think this one is false (see above): > > > This particular notion of cased and uncased characters is also > > useful in other parts of the standard for saying exactly which > > characters case requirements should apply to. > >The other rationales are are good reasons to say _something_ but I >don't think two new procedures are needed. Instead, the possibilitiy >of oddly-cased characters can be explicitly mentioned in the >definitions of CHAR-LOWER-CASE?, CHAR-UPPER-CASE?, and CHAR-LETTER?. I think it is, in fact, vital. These two predicates precisely capture the set of characters that the case relationships in R5RS can be meaningfully applied to. These are necessary and sufficient relationships for the normal meanings of char-ci=?, string-ci=?, etc, to apply, and correctly capture the necessary properties for case insensitivity for identifiers. >(Additionally, CASED and UNCASED seems like poor names for the classes >of characters they describe.) I'm not terribly attached to the names. Feel free to suggest alternatives. > >> [....] char-upcase must map a..z to A..Z and > >> char-downcase must map A..Z to a..z. > > > I would propose instead: > > > [...] if char is alphabetic and cased, then the result of > > char-upcase is upper case and the result of char-downcase is > > lower case. > >I'm not sure I see any value to the stronger requirement, especially >since CHAR-ALPHABETIC? should be deprecated and there is otherwise no >need to introduce the concept of a "cased" character. I think I'm with you on this; alphabetic-ness isn't the important property here. it should probably read, [..] if char is cased, then the result of char-upcase is upper case and the result of char-downcase is lower case. > >> The introduction to strings [....] should say: > > >> Some of the procedures that operate on strings ignore the difference > >> between strings in which upper and lower case variants of the same > >> character occur in corresponding positions. The versions that ignore > >> case have ``-ci'' (for ``case insensitive'') embedded in their > >> names. > > > I would propose instead: > > > Some of the procedures that operate on strings ignore the difference > > between upper and lower case cased characters. The versions that > > ignore case in cased characters have ``-ci'' (for ``case > > insensitive'') embedded in their names. > >I believe that this should be true: > > (char=? #\dotless-i #\U+0131) => #t > (char-ci=? #\I #\dotless-i) => #t > >and that STRING-CI=? is just the string equivalence induced by >CHAR-CI=?. I believe that (char-ci=? #\I #\dotless-i) => #f Because (char=? (char-downcase #\I) #\dotless-i) => #f. >However, #\dotless-i is not "cased" as you have defined it. Are you >saying that #\dotless-i and #\I are not CHAR-CI=? or that STRING-CI=? >is not the equivalence induced by CHAR-CI=?? Either way: why in the >world do that? Dotless-i is not cased because it is not stable under case mappings. It is not the preferred lowercase form of its own preferred uppercase form. (char=? #\dotless-i (char-downcase (char-upcase #\dotless-i))) => #f If you have a system in which #\dotless-i and #\i are both treated as cased characters whose uppercase is #\I, then two identifiers, one written using a dotted lowercase i and one written using a dotless i, can be confused with one another in an implementation whose preferred case is uppercase. #\dotless-i and #\I therefore ought not be regarded as char-ci=? in any system which also regards #\i and #\I as char-ci=?. It is true that (char=? (char-upcase #\dotless-i) (char-upcase #\i) #\I) => #t, But given that you want to require (char=? (char-downcase #\I) #\dotless-i) => #f and (char=? (char-downcase #\I) #\i) => #t, it is clearly unsupportable to choose #\dotless-i over #\i as the lower case character which is char-ci=? to #\I. Note: I think that all scheme code should be read and written using some standard locale like the "C locale" for portability, and I think that it should be a locale in which (char-ci=? #\i #\I) => #t. It is possible, however, that in some locales #\dotless-i would be a cased character, because it would be the preferred lowercase form of its own preferred uppercase form. In those locales, however, #\i would be a cased character if and only if it had the same reciprocal relationship with a _different_ upper case character, most likely U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE. The result of case-mapping via char-ci=? only on cased characters is that distinct identifiers written using these characters remain distinct no matter what the preferred case of the implementation is. That's the desirable, crucial property that I was trying to capture with the distinction between cased and uncased characters. Bear