character predicates Tom Lord 10 Feb 2004 21:06 UTC

Alex raises this question:

     > The question remains how to handle R5RS character predicates
     > related to these values:

     >   * char-alphabetic? char
     >   * char-numeric? char
     >   * char-whitespace? char   ; rename to char-white-space? please!
     >   * char-upper-case? char   ; rename to char-uppercase? please!
     >   * char-lower-case? char   ; rename to char-lowercase? please!

     > As mentioned, it can be useful to have these functioning on
     > pure ASCII for use in parsers and tools for common protocols.
     > Moreover, the Unicode equivalents are often very expensive (if
     > not in time then in space).  Should a Scheme that wants to
     > provide the full Unicode equivalents of these extend the core
     > procedures or should we define disjoint procedures such as

     >   * char-unicode-alphabetic?

First, while I like the suggested renames (e.g., CHAR-LOWER-CASE ->
CHAR-LOWERCASE), I think it is not the place of _this_ SRFI to propose
those changes.   They would do nothing to help implementations provide
Unicode support while also conforming to R6RS.

Second, I think it is essential that, regardless of any changes
proposed in this SRFI, those procedures must have the same behavior
they have in R5RS when applied to the "portable Scheme character
set".   The portable character set is not quite ASCII (the integer
mappings are specified and not all ASCII characters are included, even
abstractly) -- but it can be regarded as a subset of the abstract
characters encoded in ASCII.

Third, should a Unicode Scheme extend those predicates?  or define new
ones?  In a weak sense, that's not a question for this SRFI.   I've
specified the answer I prefer in another draft ("Scheme Characters as
(Extended) Unicode Codepoints",
http://regexps.srparish.net/srfi-drafts/unicode-chars.srfi) but those
answers should not be presumed by this SRFI.

What _are_ questions for this SRFI are: should an implementation be
_permitted_ to extend those predicates.  If so, should it be permitted
to extend them in the "most natural" way for Unicode characters?  (The
other draft I just mentioned explains what I think the "most natural"
way is.)

The strict letter of the law in R5RS says (by implication):

  ~ Yes, implementations _may_ extend those predicates.
    (They are, indeed, expected to do so.)

  ~ No, implementations _may_not_ use the most natural Unicode
    definitions.   In particular, R5RS requires that alphabetic
    characters must return an upper case equivalent from CHAR-UPCASE
    and a lower case equivalent from CHAR-DOWNCASE.   So the
    specifications for all of these procedures:

	char-alphabetic?
        char-upper-case?
        char-lower-case?
        char-upcase
        char-downcase
        char-ci=?

    are "intertwingled" in an unfortunate way: not all Unicode
    characters that ought to be considered "alphabetic" satisfy
    the case-mapping requirements of R5RS.

The situation is worsened by the relationship between those
procedures, STRING-CI=?, identifier equivalence, and the relationship
between a literal symbol name in a program text and the string
returned by SYMBOL->STRING for that symbol.  For example, R5RS says
(by implication) that the SYMBOL->STRING value can be formed from an
identifier name by applying one of (depending on the implementation's
preferred case) CHAR-UPCASE or CHAR-DOWNCASE to each character of the
identifier.  Unicode defines a (fairly complicated) algorithm defining
"case-insensitive identifier equivalence" -- but it has little resemblence
to the naive algorithm implied by R5RS.

My opinion is that R5RS is wrong to forbid the "most natural" Unicode
extensions of these standard procedures.  Some of the revisions
proposed in this SRFI are aimed at removing that restriction.

In designing the proposed revisions, I reasoned this way:

~ Incompatible Changes Must Not Be Made.

  Specifically, the unfortunate "intertwingling" of the procedures
  listed above all hinges on CHAR-ALPHABETIC?.   By happy coincidence,
  with a global character set, CHAR-ALPHABETIC? is a poor choice of
  name for the concept that procedure is intended to capture --
  CHAR-LETTER? (that's "Letter" in the broad sense of the Unicode
  standard) is a better name.

  Rather than undo the intertwingling by changing _any_ of the
  procedures in an incompatible way, it is simpler to leave
  CHAR-ALPHABETIC? in its damaged state, deprecate it, and introduce a
  new procedure -- CHAR-LETTER? -- defined in a way that doesn't
  perpetuate the problems.

~ Implementations Must Provide the Identifier -> Symbol Mapping

  The naive process of applying CHAR-UPCASE (or CHAR-DOWNCASE) to
  every character in an identifier to yield it's canonical symbol name
  is far removed from reality.

  The Unicode process for canonicalizing a symbol name is quite
  complicated and would require a great deal of more primitive
  machinery to implement in Scheme.

  Finally, this SRFI is _not_ intended to be Unicode-specific: only
  to be Unicode-permissive.   So it is not the place of this SRFI to
  specify a canonicalization algorithm.

  Therefore, I have proposed that the revised report just give general
  guidance (that case distincitions are ignored; that implementations
  have a preferred case) -- but that they must also provide their
  canonicalization algorithm as a required procedure.  Rather than
  trying to "casemap identifiers" themselves, programs should use the
  new STRING->SYMBOL-NAME procedure.

~ The Portable Characte Set Must Retain Its Simple Structure

  For example, if an identifier name is spelled using only the
  portable character set, then the CHAR-UPCASE (or DOWNCASE) technique
  for canonicalizing that identifier name should continue to work.

  Really, this requirement is a kind of "corallary" of the earlier
  one that "Incompatible Changes Must Not Be Made" but it is worth
  mentioning specially.

  In the draft SRFI, I have ensured that the portable character set
  retains its simple structure by including explicit language to that
  effect.   For example, CHAR-UPCASE and CHAR-DOWNCASE are described:

     These procedures return a character CHAR2 such that
     (CHAR-CI=? CHAR CHAR2). In addition, CHAR-UPCASE must
     map a..z to A..Z and CHAR-DOWNCASE must map A..Z to a..z.

-t