REPOST:Permitting and Supporting Extended Character Sets: response. (fwd)

REPOST:Permitting and Supporting Extended Character Sets: response. (fwd) bear 10 Feb 2004 02:21 UTC
<previously posted on SRFI-50>

On Sat, 7 Feb 2004, Tom Lord wrote:

> (in "Permitting and Supporting Extended Character Sets":)

> The specification of symbol->string says:
> Returns the name of symbol as a string. If the symbol was part of an
> object returned as the value of a literal expression (section 4.1.2)
> or by a call to the read procedure, and its name contains alphabetic
> characters, then the string returned will contain characters in the
> implementation's preferred standard case -- some implementations
> will prefer upper case, others lower case. If the symbol was
> returned by string->symbol, the case of characters in the string
> returned will be the same as the case in the string that was passed
> to string->symbol. It is an error to apply mutation procedures like
> string-set! to strings returned by this procedure.

> It should say:
> Returns the name of symbol as a string. If the symbol was part of an
> object returned as the value of a literal expression (section 4.1.2)
> or by a call to the read procedure, its name will be in the
> implementation's preferred standard case -- some implementations
> will prefer upper case, others lower case. If the symbol was
> returned by string->symbol, the string returned will be string=? to
> the string that was passed to string->symbol. It is an error to
> apply mutation procedures like string-set! to strings returned by
> this procedure.

I would propose instead:

 Returns the name of symbol as a string. If the symbol was part of an
 object returned as the value of a literal expression (section 4.1.2)
 or by a call to the read procedure, then all cased characters in the
 identifier (see the definition of char-cased? for a precise definition
 of cased and uncased characters) will be in the implementation's preferred
 standard case -- some implementations will prefer upper case, others
 lower case.  If the symbol was returned by string->symbol, the case
 of the characters in the string returned will be the same as the case
 in the string that was passed to string->symbol. It is an error to
 apply mutation procedures like string-set! to strings returned by this
 procedure.

Rationale; I think it's simply clearer.  The above wording
specifically permits uncased characters (ie, characters which do not
conform to "normal" expectations of cased characters) to be present
in lowercase in identifiers even if the preferred case is uppercase,
and presumably vice versa.

> With regard to character class predicates such as char-alphabetic?
> the Revised Report says:
> These procedures return #t if their arguments are alphabetic,
> numeric, whitespace, upper case, or lower case characters,
> respectively, otherwise they return #f. The following remarks, which
> are specific to the ASCII character set, are intended only as a
> guide: The alphabetic characters are the 52 upper and lower case
> letters. The numeric characters are the ten decimal digits. The
> whitespace characters are space, tab, line feed, form feed, and
> carriage return.

> It should instead say:
> These procedures return #t if their arguments are alphabetic,
> numeric, whitespace, upper case, or lower case characters,
> respectively, otherwise they return #f. The characters a..z and A..Z
> must be alphabetic. The digits 0..9 must be numeric. Space and
> newline must be whitespace.

> The procedure char-alphabetic? is deprecated. New programs should
> usually use char-letter? (see below) instead. char-alphabetic? has a
> precise definition in terms of char-letter?:

>    (define (char-alphabetic? c)
>      (and (char-letter? c)
>           (char-upper-case? (char-upcase c))
>           (char-lower-case? (char-downcase c))))

> In other words, a character is "alphabetic" if it is a letter and
> the letter has both upper and lowercase forms. The characters #\a..#\z
> and #\A..#\Z are both "alphabetic" and "letters" -- however,
> implementations are free to add letters which are not alphabetic.

This is not how linguists use the term "alphabetic."  Please do not
propose "alphabetic" as a procedure to use to mean this, as it will
frustrate and confuse people.

I propose instead:

char-alphabetic?
char-numeric?
char-whitespace?
char-upper-case?
char-lower-case?

 These procedures return #t if their arguments are alphabetic,
 numeric, whitespace, uppercase, or lowercase characters, respectively.
 Otherwise they return #f. The characters a..z and A..Z are required to
 be alphabetic. The digits 0..9 must be numeric.  The space, newline, and
 tab characters must be whitespace.  The characters a..z are required to
 be lowercase.  The characters A..Z are required to be uppercase.  No
 character may be both uppercase and lowercase.

char-cased?
char-uncased?

 Char-cased? returns #t if its argument is a character which conforms to
 "normal" case expectations, (see below) and #f otherwise.

 Char-uncased returns #t if its argument is a character which does not
 have both lowercase and uppercase forms which are single characters, or
 if it is neither the preferred uppercase mapping of its own preferred
 lowercase mapping nor the preferred lowercase mapping of its own
 preferred uppercase mapping.  Otherwise, it returns #f.

 These two functions could be defined by:

   (define char-cased? (lambda (chararg)
      (and (character? (char-upcase chararg))
           (char-upper-case? (char-upcase chararg))
           (character? (char-downcase chararg))
           (char-lower-case? (char-downcase chararg))
           (or (char=? chararg (char-downcase (char-upcase chararg)))
               (char=? chararg (char-upcase (char-downcase chararg)))))))

   (define char-uncased? (lambda chararg)
       (not (char-cased? chararg)))

 In the ASCII character set, whitespace, numeric, and punctuation
 characters are uncased and alphabetic characters are cased.  However,
 an implementation may provide uncased characters which are alphabetic,
 such as hebrew characters or chinese ideographs which have no notion
 of case, or german eszett, which is char-lower-case? but for which
 there is no corresponding upper case character.

Rationale: This allows char-lower-case?, char-upper-case?, and
char-alphabetic? to go on meaning the same thing with respect to the
96-character portable character set and meaning the same thing
linguists mean when they use these terms.  This will reduce confusion
in the long run.  This particular notion of cased and uncased
characters is also useful in other parts of the standard for saying
exactly which characters case requirements should apply to.  It leaves
implementors free to not sweat about what to do with identifiers
containing eszett, regardless of what they do with calls to
(char-upcase #\eszett).

Finally, the proposed redefinition of char-alphabetic? was not
sufficient to capture all the potential problems that uncased (in the
above sense) characters can cause.

> With regard to case-mapping, the specification of char-upcase and char-upcase says:

> These procedures return a character char2 such that (char-ci=? char
> char2). In addition, if char is alphabetic, then the result of
> char-upcase is upper case and the result of char-downcase is lower
> case.

> It should say

> These procedures return a character char2 such that (char-ci=? char
> char2). In addition, char-upcase must map a..z to A..Z and
> char-downcase must map A..Z to a..z.

I would propose instead:

 These procedures return a character char2 such that (char-ci=? char
 char2). In addition, if char is alphabetic and cased, then the result
 of char-upcase is upper case and the result of char-downcase is lower
 case.

> The introduction to strings says:

> Some of the procedures that operate on strings ignore the difference
> between upper and lower case. The versions that ignore case have
> ``-ci'' (for ``case insensitive'') embedded in their names.

> It should say:

> Some of the procedures that operate on strings ignore the difference
> between strings in which upper and lower case variants of the same
> character occur in corresponding positions. The versions that ignore
> case have ``-ci'' (for ``case insensitive'') embedded in their
> names.

I would propose instead:

 Some of the procedures that operate on strings ignore the difference
 between upper and lower case cased characters. The versions that
 ignore case in cased characters have ``-ci'' (for ``case
 insensitive'') embedded in their names.

I think that if we have the new procedures char-cased? and char-uncased?
we do not need the proposed char-letter? predicate.

				Bear