Unicode and Scheme
Tom Lord
(07 Feb 2004 22:33 UTC)
|
Permitting and Supporting Extended Character Sets: response.
bear
(09 Feb 2004 05:03 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response. Tom Lord (09 Feb 2004 17:00 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
bear
(09 Feb 2004 20:42 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(09 Feb 2004 21:55 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
bear
(10 Feb 2004 00:23 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(10 Feb 2004 00:33 UTC)
|
Re: Unicode and Scheme
bear
(09 Feb 2004 05:26 UTC)
|
Re: Unicode and Scheme
Tom Lord
(09 Feb 2004 17:15 UTC)
|
Re: Unicode and Scheme
bear
(09 Feb 2004 20:47 UTC)
|
> From: bear <xxxxxx@sonic.net> I'll mostly answer your points in order but the last one is the most interesting: > I think that if we have the new procedures char-cased? and > char-uncased? we do not need the proposed char-letter? > predicate. (I argue below that your definition of "cased" characters is problematic but that's not the main point here.) A while back, tb argued that the case-mapping procedures of R5RS could simply be dropped. There's something to that. In fact, R6RS could go further -- it could: DROP: RETAIN: (case and classes) (type, order, integer isomorphism) char? char-ci=? char=? char-ci<? char<? char-ci>? char>? char-ci<=? char<=? char-ci>=? char>=? char-alphabetic? char->integer char-numeric? integer->char char-whitespace? char-upper-case? char-lower-case? char-upcase char-downcase string-ci=? string-ci<? string-ci>? string-ci<=? string-ci>=? ADD: (metacircularity procedures) char-delimiter? string->character string->string string->symbol-name form-identifier Why do that? I'm not convinced we should but the arguments for doing so would include: ~ it would remove from R5RS all traces of the naive approach to character case ~ it would remove from R5RS the culturally biased character class "alphabetic" ~ it would evaded the tricky problem of define "numeric" usefully yet without cultural bias ~ those changes would leave only the class CHAR-WHITESPACE? which seems particularly odd in isolation ~ the ability to write metacircular programs would still be present -- and improved ~ the basic structure of the CHAR? type, a well-ordered set isomorphic to a subset of the integers, would be retained Why not do it? ~ pedagogical reasons -- for the portable character set, the metacircularity procedures can be defined using the dropped procedures ~ practical reason -- it wouldn't leave enough standard machinary in Scheme to parse simple formats like "whitespace separated fields" ~ practical reason -- implementors will want to provide all of the procedures in the DROP column for years to come, at least. Useful libraries will continue to rely on them. It is worthwhile to (continue to) say what they should mean. But, on to the proposed revisions to the proposed revisions to the revised^5 specification: >> It should say: >> Returns the name of symbol as a string. [...] will be in the >> implementation's preferred standard case [...] >> will prefer upper case, others lower case. If the symbol was >> returned by string->symbol, [....] string=? to >> the string that was passed to string->symbol. [....] > I would propose instead: > Returns the name of symbol as a string. [...] all cased > characters in the identifier (see the definition of char-cased? > for a precise definition of cased and uncased characters) will > be in the implementation's preferred standard case [....]. If > the symbol was returned by string->symbol, the case of the > characters in the string returned will be the same as the case > in the string that was passed to string->symbol. [....] > Rationale; I think it's simply clearer. The above wording > specifically permits uncased characters (ie, characters which do not > conform to "normal" expectations of cased characters) to be present > in lowercase in identifiers even if the preferred case is uppercase, > and presumably vice versa. Huh. I thought that my wording permitted that already. I mostly dislike your wording. This part: > all cased characters in the identifier [...] will be in the > implementation's preferred standard case seems too strong to me. I'd be willing to accept it if (a) we nail a good STRING->SYMBOL-NAME definition for the "Unicode Identifiers" draft; (b) prove that the property you named is true for that STRING->SYMBOL-NAME and for all future versions of Unicode. This part: > If the symbol was returned by string->symbol, the case of the > characters in the string returned will be the same as the case > in the string that was passed to string->symbol. is too weak. The two strings must be STRING=?. For example, a Unicode STRING->SYMBOL must not canonicalize its argument (and STRING=? is a codepoint-wise comparison). >> With regard to character class predicates such as char-alphabetic? >> [...] >> The procedure char-alphabetic? is deprecated. New programs should >> usually use char-letter? (see below) instead. char-alphabetic? has a >> precise definition in terms of char-letter?: >> (define (char-alphabetic? c) >> (and (char-letter? c) >> (char-upper-case? (char-upcase c)) >> (char-lower-case? (char-downcase c)))) > This is not how linguists use the term "alphabetic." Please do > not propose "alphabetic" as a procedure to use to mean this, as > it will frustrate and confuse people. It's true that that is not how linguists use the term "alphabetic". It's also true that not all "letters", in the sense of Unicode, are alphabetic characters. For example, ideographic characters are categorized in Unicode as "letters"; syllabaries are classified as "letters". In a Unicode implementation, a linguistic definition of CHAR-ALPHABETIC? would be a subset of letters generally and would include both characters which are not cased (U+13A0 ("CHEROKEE LETTER A")) and characters with no single-character case-mappings (U+00DF ("LATIN SMALL LETTER SHARP S")). That would, in some sense, be a an interesting procedure to have around -- but really it belongs in a general library for linguistic text processing (along with many other procedures). Worse, a linguistically proper definition of CHAR-ALPHABETIC? would be upwards incompatible with R5RS which requires that alphabetic characters have upper and lowercase forms (which are themselves characters). When thinking about how to handle this situation, I reasoned this way: 1) One use for the R5RS character classes is to write programs which process s-expressions (e.g. source text) over the portable character set. This use should be preserved. 2) Another use for the R5RS character classes is to write programs which parse other simple kinds of syntax. For example, parsing a line of text into white-space separated fields. This use should be preserved and expanded. For example, CHAR-LETTER? allows for a field of letters which are not alphabetic characters or which are alphabetic but not case-mapped in the naive way. 3) The R5RS character classes have never been well suited for linguistic processing over anything but the portable character set. Their use for such purposes for extended characters is unrealistic. 4) Upward compatability with R5RS is desirable. 5) The specifications for the character classes defined in R6RS should be consistent with definitions that satisfy the usual expectations of a Unicode programmer. In other words, in a Unicode-based implementation, these procedures should function as a useful subset of a comprehensive library for Unicode text processing. So, I proposed: adding CHAR-LETTER? which is (consistent with being) the generalization of CHAR-ALPHABETIC? to all "letters" (in the Unicode sense); deprecating CHAR-ALPHABETIC? (which is esoteric at best, nonsense at worst); and defining the class of CHAR-ALPHABETIC? characters to be the largest subset of CHAR-LETTER? which is consistent with the R5RS definition. Now, having said all of that, the definition of CHAR-ALPHABETIC? could be improved: The possibilitiy of non-alphabetic letters with both upper and lowercase forms seems plausble to me (are there any in Unicode already?) So, instead of that definition of CHAR-ALPHABETIC? I would agree to: CHAR-ALPHABETIC? must be defined in such a way that this is true of all characters: (or (not (char-alphabetic? c)) (and (char-letter? c) (char-upper-case? (char-upcase c)) (char-lower-case? (char-downcase c)))) => #t Note: this requirement is necessary for a combination of upward compatability with earlier versions of the Revised Report and consistency with the new CHAR-LETTER?, yet it is also linguistically undesirable. This is the reason that CHAR-ALPHABETIC? is described as "deprecated" -- new programs should avoid using this procedure and should, in most cases, use CHAR-LETTER? instead. Programmers should be aware that the class CHAR-LETTER? may include letters such as syllables and ideographs which are not, in any sense, "alphabetic". It can also include alphabetic characters which are neither upper or lowercase, lowercase letters with no uppercase form, uppercase letters with no lowercase form, lowercase characters which are not returned by CHAR-DOWNCASE of their CHAR-UPCASE mapping, and uppercase charactes which are not returned by CHAR-UPCASE of their CHAR-DOWNCASE mapping. Programmers should also be aware that in some situations, a string may contain a letter followed by non-letters -- the sequence being "what a user would think of as a single letter" -- a fact which limits the utility of even CHAR-LETTER? unless additional facilities for text processing are provided by an implementation. Yet at the same time, for the portable character set and for many extended characters, none of these peculiar circumstances apply -- programmers not trying to write "fully general" text processing algorithms can often ignore these complexities. Programmers wanting to write "fully general" text algorithms, on the other hand, can define additional procedures which complement the standard character classes. > char-alphabetic? > char-numeric? > char-whitespace? > char-upper-case? > char-lower-case? > These procedures return #t if their arguments are alphabetic, > numeric, whitespace, uppercase, or lowercase characters, respectively. > Otherwise they return #f. The characters a..z and A..Z are required to > be alphabetic. The digits 0..9 must be numeric. The space, newline, and > tab characters must be whitespace. The characters a..z are required to > be lowercase. The characters A..Z are required to be uppercase. No > character may be both uppercase and lowercase. That's consistent with my proposed revisions. I think CHAR-LETTER? ought to be added and CHAR-ALPHABETIC? either dropped entirely or mentioned as deprecated. If it is mentioned as deprecated, the invariant shown above should be stated here. The corresponding sentence in the definition of CHAR-UPCASE and CHAR-DOWNCASE should be dropped. > char-cased? > char-uncased? > Char-cased? returns #t if its argument is a character which conforms to > "normal" case expectations, (see below) and #f otherwise. [....] > Rationale: This allows char-lower-case?, char-upper-case?, and > char-alphabetic? to go on meaning the same thing with respect to the > 96-character portable character set and meaning the same thing > linguists mean when they use these terms. This will reduce confusion > in the long run. This particular notion of cased and uncased > characters is also useful in other parts of the standard for saying > exactly which characters case requirements should apply to. It leaves > implementors free to not sweat about what to do with identifiers > containing eszett, regardless of what they do with calls to > (char-upcase #\eszett). Among the rationales: I think this one is false (see above): > This particular notion of cased and uncased characters is also > useful in other parts of the standard for saying exactly which > characters case requirements should apply to. The other rationales are are good reasons to say _something_ but I don't think two new procedures are needed. Instead, the possibilitiy of oddly-cased characters can be explicitly mentioned in the definitions of CHAR-LOWER-CASE?, CHAR-UPPER-CASE?, and CHAR-LETTER?. (Additionally, CASED and UNCASED seems like poor names for the classes of characters they describe.) >> With regard to [...] char-upcase and char-upcase >> It should say >> [....] char-upcase must map a..z to A..Z and >> char-downcase must map A..Z to a..z. > I would propose instead: > [...] if char is alphabetic and cased, then the result of > char-upcase is upper case and the result of char-downcase is > lower case. I'm not sure I see any value to the stronger requirement, especially since CHAR-ALPHABETIC? should be deprecated and there is otherwise no need to introduce the concept of a "cased" character. Your alternative is implied by the definition of CHAR-ALPHABETIC? I gave in the draft -- but you've earlier convinced me to weaken that definition. >> The introduction to strings [....] should say: >> Some of the procedures that operate on strings ignore the difference >> between strings in which upper and lower case variants of the same >> character occur in corresponding positions. The versions that ignore >> case have ``-ci'' (for ``case insensitive'') embedded in their >> names. > I would propose instead: > Some of the procedures that operate on strings ignore the difference > between upper and lower case cased characters. The versions that > ignore case in cased characters have ``-ci'' (for ``case > insensitive'') embedded in their names. I believe that this should be true: (char=? #\dotless-i #\U+0131) => #t (char-ci=? #\I #\dotless-i) => #t and that STRING-CI=? is just the string equivalence induced by CHAR-CI=?. However, #\dotless-i is not "cased" as you have defined it. Are you saying that #\dotless-i and #\I are not CHAR-CI=? or that STRING-CI=? is not the equivalence induced by CHAR-CI=?? Either way: why in the world do that? -t ---- Like my work on GNU arch, Pika Scheme, and other technical contributions to the public sphere? Show your support! https://www.paypal.com/xclick/business=lord%40emf.net&item_name=support+for+arch+and+other+free+software+efforts+by+tom+lord&no_note=1&tax=0¤cy_code=USD and xxxxxx@emf.net for www.moneybookers.com payments.