Unicode and Scheme
Tom Lord
(07 Feb 2004 22:33 UTC)
|
Permitting and Supporting Extended Character Sets: response. bear (09 Feb 2004 05:03 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(09 Feb 2004 17:00 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
bear
(09 Feb 2004 20:42 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(09 Feb 2004 21:55 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
bear
(10 Feb 2004 00:23 UTC)
|
Re: Permitting and Supporting Extended Character Sets: response.
Tom Lord
(10 Feb 2004 00:33 UTC)
|
Re: Unicode and Scheme
bear
(09 Feb 2004 05:26 UTC)
|
Re: Unicode and Scheme
Tom Lord
(09 Feb 2004 17:15 UTC)
|
Re: Unicode and Scheme
bear
(09 Feb 2004 20:47 UTC)
|
On Sat, 7 Feb 2004, Tom Lord wrote: > (in "Permitting and Supporting Extended Character Sets":) > The specification of symbol->string says: > Returns the name of symbol as a string. If the symbol was part of an > object returned as the value of a literal expression (section 4.1.2) > or by a call to the read procedure, and its name contains alphabetic > characters, then the string returned will contain characters in the > implementation's preferred standard case -- some implementations > will prefer upper case, others lower case. If the symbol was > returned by string->symbol, the case of characters in the string > returned will be the same as the case in the string that was passed > to string->symbol. It is an error to apply mutation procedures like > string-set! to strings returned by this procedure. > It should say: > Returns the name of symbol as a string. If the symbol was part of an > object returned as the value of a literal expression (section 4.1.2) > or by a call to the read procedure, its name will be in the > implementation's preferred standard case -- some implementations > will prefer upper case, others lower case. If the symbol was > returned by string->symbol, the string returned will be string=? to > the string that was passed to string->symbol. It is an error to > apply mutation procedures like string-set! to strings returned by > this procedure. I would propose instead: Returns the name of symbol as a string. If the symbol was part of an object returned as the value of a literal expression (section 4.1.2) or by a call to the read procedure, then all cased characters in the identifier (see the definition of char-cased? for a precise definition of cased and uncased characters) will be in the implementation's preferred standard case -- some implementations will prefer upper case, others lower case. If the symbol was returned by string->symbol, the case of the characters in the string returned will be the same as the case in the string that was passed to string->symbol. It is an error to apply mutation procedures like string-set! to strings returned by this procedure. Rationale; I think it's simply clearer. The above wording specifically permits uncased characters (ie, characters which do not conform to "normal" expectations of cased characters) to be present in lowercase in identifiers even if the preferred case is uppercase, and presumably vice versa. > With regard to character class predicates such as char-alphabetic? > the Revised Report says: > These procedures return #t if their arguments are alphabetic, > numeric, whitespace, upper case, or lower case characters, > respectively, otherwise they return #f. The following remarks, which > are specific to the ASCII character set, are intended only as a > guide: The alphabetic characters are the 52 upper and lower case > letters. The numeric characters are the ten decimal digits. The > whitespace characters are space, tab, line feed, form feed, and > carriage return. > It should instead say: > These procedures return #t if their arguments are alphabetic, > numeric, whitespace, upper case, or lower case characters, > respectively, otherwise they return #f. The characters a..z and A..Z > must be alphabetic. The digits 0..9 must be numeric. Space and > newline must be whitespace. > The procedure char-alphabetic? is deprecated. New programs should > usually use char-letter? (see below) instead. char-alphabetic? has a > precise definition in terms of char-letter?: > (define (char-alphabetic? c) > (and (char-letter? c) > (char-upper-case? (char-upcase c)) > (char-lower-case? (char-downcase c)))) > In other words, a character is "alphabetic" if it is a letter and > the letter has both upper and lowercase forms. The characters #\a..#\z > and #\A..#\Z are both "alphabetic" and "letters" -- however, > implementations are free to add letters which are not alphabetic. This is not how linguists use the term "alphabetic." Please do not propose "alphabetic" as a procedure to use to mean this, as it will frustrate and confuse people. I propose instead: char-alphabetic? char-numeric? char-whitespace? char-upper-case? char-lower-case? These procedures return #t if their arguments are alphabetic, numeric, whitespace, uppercase, or lowercase characters, respectively. Otherwise they return #f. The characters a..z and A..Z are required to be alphabetic. The digits 0..9 must be numeric. The space, newline, and tab characters must be whitespace. The characters a..z are required to be lowercase. The characters A..Z are required to be uppercase. No character may be both uppercase and lowercase. char-cased? char-uncased? Char-cased? returns #t if its argument is a character which conforms to "normal" case expectations, (see below) and #f otherwise. Char-uncased returns #t if its argument is a character which does not have both lowercase and uppercase forms which are single characters, or if it is neither the preferred uppercase mapping of its own preferred lowercase mapping nor the preferred lowercase mapping of its own preferred uppercase mapping. Otherwise, it returns #f. These two functions could be defined by: (define char-cased? (lambda (chararg) (and (character? (char-upcase chararg)) (char-upper-case? (char-upcase chararg)) (character? (char-downcase chararg)) (char-lower-case? (char-downcase chararg)) (or (char=? chararg (char-downcase (char-upcase chararg))) (char=? chararg (char-upcase (char-downcase chararg))))))) (define char-uncased? (lambda chararg) (not (char-cased? chararg))) In the ASCII character set, whitespace, numeric, and punctuation characters are uncased and alphabetic characters are cased. However, an implementation may provide uncased characters which are alphabetic, such as hebrew characters or chinese ideographs which have no notion of case, or german eszett, which is char-lower-case? but for which there is no corresponding upper case character. Rationale: This allows char-lower-case?, char-upper-case?, and char-alphabetic? to go on meaning the same thing with respect to the 96-character portable character set and meaning the same thing linguists mean when they use these terms. This will reduce confusion in the long run. This particular notion of cased and uncased characters is also useful in other parts of the standard for saying exactly which characters case requirements should apply to. It leaves implementors free to not sweat about what to do with identifiers containing eszett, regardless of what they do with calls to (char-upcase #\eszett). Finally, the proposed redefinition of char-alphabetic? was not sufficient to capture all the potential problems that uncased (in the above sense) characters can cause. > With regard to case-mapping, the specification of char-upcase and char-upcase says: > These procedures return a character char2 such that (char-ci=? char > char2). In addition, if char is alphabetic, then the result of > char-upcase is upper case and the result of char-downcase is lower > case. > It should say > These procedures return a character char2 such that (char-ci=? char > char2). In addition, char-upcase must map a..z to A..Z and > char-downcase must map A..Z to a..z. I would propose instead: These procedures return a character char2 such that (char-ci=? char char2). In addition, if char is alphabetic and cased, then the result of char-upcase is upper case and the result of char-downcase is lower case. > The introduction to strings says: > Some of the procedures that operate on strings ignore the difference > between upper and lower case. The versions that ignore case have > ``-ci'' (for ``case insensitive'') embedded in their names. > It should say: > Some of the procedures that operate on strings ignore the difference > between strings in which upper and lower case variants of the same > character occur in corresponding positions. The versions that ignore > case have ``-ci'' (for ``case insensitive'') embedded in their > names. I would propose instead: Some of the procedures that operate on strings ignore the difference between upper and lower case cased characters. The versions that ignore case in cased characters have ``-ci'' (for ``case insensitive'') embedded in their names. I think that if we have the new procedures char-cased? and char-uncased? we do not need the proposed char-letter? predicate. Bear