Unicode and Scheme Tom Lord (07 Feb 2004 22:33 UTC)
Re: Permitting and Supporting Extended Character Sets: response. Tom Lord (09 Feb 2004 17:00 UTC)
Re: Unicode and Scheme bear (09 Feb 2004 05:26 UTC)
Re: Unicode and Scheme Tom Lord (09 Feb 2004 17:15 UTC)
Re: Unicode and Scheme bear (09 Feb 2004 20:47 UTC)

Re: Permitting and Supporting Extended Character Sets: response. Tom Lord 09 Feb 2004 17:16 UTC


    > From: bear <xxxxxx@sonic.net>

I'll mostly answer your points in order but the last one is the most
interesting:

    > I think that if we have the new procedures char-cased? and
    > char-uncased?  we do not need the proposed char-letter?
    > predicate.

(I argue below that your definition of "cased" characters is
problematic but that's not the main point here.)

A while back, tb argued that the case-mapping procedures of R5RS could
simply be dropped.   There's something to that.

In fact, R6RS could go further -- it could:

	DROP:			RETAIN:
        (case and classes)	(type, order, integer isomorphism)

				char?
	char-ci=?               char=?
        char-ci<?               char<?
        char-ci>?               char>?
        char-ci<=?              char<=?
        char-ci>=?              char>=?
        char-alphabetic?	char->integer
        char-numeric?		integer->char
        char-whitespace?
        char-upper-case?
        char-lower-case?
        char-upcase
        char-downcase
        string-ci=?
        string-ci<?
        string-ci>?
        string-ci<=?
        string-ci>=?

        ADD:
        (metacircularity procedures)

	char-delimiter?
        string->character
        string->string
        string->symbol-name
        form-identifier

Why do that?  I'm not convinced we should but the arguments for doing
so would include:

~ it would remove from R5RS all traces of the naive approach
  to character case

~ it would remove from R5RS the culturally biased character
  class "alphabetic"

~ it would evaded the tricky problem of define "numeric" usefully
  yet without cultural bias

~ those changes would leave only the class CHAR-WHITESPACE? which
  seems particularly odd in isolation

~ the ability to write metacircular programs would still be
  present -- and improved

~ the basic structure of the CHAR? type, a well-ordered set isomorphic
  to a subset of the integers, would be retained

Why not do it?

~ pedagogical reasons -- for the portable character set, the
  metacircularity procedures can be defined using the dropped
  procedures

~ practical reason -- it wouldn't leave enough standard machinary in
  Scheme to parse simple formats like "whitespace separated fields"

~ practical reason -- implementors will want to provide all of the
  procedures in the DROP column for years to come, at least.  Useful
  libraries will continue to rely on them.  It is worthwhile to
  (continue to) say what they should mean.

But, on to the proposed revisions to the proposed revisions to the
revised^5 specification:

    >> It should say:

    >> Returns the name of symbol as a string. [...] will be in the
    >> implementation's preferred standard case [...]
    >> will prefer upper case, others lower case. If the symbol was
    >> returned by string->symbol, [....] string=? to
    >> the string that was passed to string->symbol. [....]

    > I would propose instead:

    >  Returns the name of symbol as a string. [...] all cased
    >  characters in the identifier (see the definition of char-cased?
    >  for a precise definition of cased and uncased characters) will
    >  be in the implementation's preferred standard case [....].  If
    >  the symbol was returned by string->symbol, the case of the
    >  characters in the string returned will be the same as the case
    >  in the string that was passed to string->symbol. [....]

    > Rationale; I think it's simply clearer.  The above wording
    > specifically permits uncased characters (ie, characters which do not
    > conform to "normal" expectations of cased characters) to be present
    > in lowercase in identifiers even if the preferred case is uppercase,
    > and presumably vice versa.

Huh.  I thought that my wording permitted that already.  I mostly
dislike your wording.

This part:

    >  all cased characters in the identifier [...] will be in the
    >  implementation's preferred standard case

seems too strong to me.  I'd be willing to accept it if (a) we nail a
good STRING->SYMBOL-NAME definition for the "Unicode Identifiers"
draft; (b) prove that the property you named is true for that
STRING->SYMBOL-NAME and for all future versions of Unicode.

This part:

    >  If the symbol was returned by string->symbol, the case of the
    >  characters in the string returned will be the same as the case
    >  in the string that was passed to string->symbol.

is too weak.  The two strings must be STRING=?.  For example, a
Unicode STRING->SYMBOL must not canonicalize its argument (and
STRING=? is a codepoint-wise comparison).

    >> With regard to character class predicates such as char-alphabetic?
    >> [...]
    >> The procedure char-alphabetic? is deprecated. New programs should
    >> usually use char-letter? (see below) instead. char-alphabetic? has a
    >> precise definition in terms of char-letter?:

    >>    (define (char-alphabetic? c)
    >>      (and (char-letter? c)
    >>           (char-upper-case? (char-upcase c))
    >>           (char-lower-case? (char-downcase c))))

    > This is not how linguists use the term "alphabetic."  Please do
    > not propose "alphabetic" as a procedure to use to mean this, as
    > it will frustrate and confuse people.

It's true that that is not how linguists use the term "alphabetic".

It's also true that not all "letters", in the sense of Unicode, are
alphabetic characters.   For example, ideographic characters are
categorized in Unicode as "letters";  syllabaries are classified as
"letters".

In a Unicode implementation, a linguistic definition of
CHAR-ALPHABETIC? would be a subset of letters generally and would
include both characters which are not cased (U+13A0 ("CHEROKEE LETTER
A")) and characters with no single-character case-mappings (U+00DF
("LATIN SMALL LETTER SHARP S")).

That would, in some sense, be a an interesting procedure to have
around -- but really it belongs in a general library for linguistic
text processing (along with many other procedures).

Worse, a linguistically proper definition of CHAR-ALPHABETIC? would be
upwards incompatible with R5RS which requires that alphabetic
characters have upper and lowercase forms (which are themselves
characters).

When thinking about how to handle this situation, I reasoned this way:

1) One use for the R5RS character classes is to write programs which
   process s-expressions (e.g. source text) over the portable
   character set.   This use should be preserved.

2) Another use for the R5RS character classes is to write programs
   which parse other simple kinds of syntax.   For example, parsing
   a line of text into white-space separated fields.   This use should
   be preserved and expanded.  For example, CHAR-LETTER? allows for a
   field of letters which are not alphabetic characters or which
   are alphabetic but not case-mapped in the naive way.

3) The R5RS character classes have never been well suited for
   linguistic processing over anything but the portable character
   set.   Their use for such purposes for extended characters is
   unrealistic.

4) Upward compatability with R5RS is desirable.

5) The specifications for the character classes defined in R6RS
   should be consistent with definitions that satisfy the
   usual expectations of a Unicode programmer.   In other words,
   in a Unicode-based implementation, these procedures should
   function as a useful subset of a comprehensive library for
   Unicode text processing.

So, I proposed: adding CHAR-LETTER? which is (consistent with being)
the generalization of CHAR-ALPHABETIC? to all "letters" (in the
Unicode sense); deprecating CHAR-ALPHABETIC? (which is esoteric at
best, nonsense at worst); and defining the class of CHAR-ALPHABETIC?
characters to be the largest subset of CHAR-LETTER?  which is
consistent with the R5RS definition.

Now, having said all of that, the definition of CHAR-ALPHABETIC? could
be improved: The possibilitiy of non-alphabetic letters with both
upper and lowercase forms seems plausble to me (are there any in
Unicode already?)  So, instead of that definition of CHAR-ALPHABETIC?
I would agree to:

      CHAR-ALPHABETIC? must be defined in such a way that
      this is true of all characters:

          (or (not (char-alphabetic? c))
              (and (char-letter? c)
                   (char-upper-case? (char-upcase c))
                   (char-lower-case? (char-downcase c))))
          => #t

      Note: this requirement is necessary for a combination of upward
      compatability with earlier versions of the Revised Report and
      consistency with the new CHAR-LETTER?, yet it is also
      linguistically undesirable.  This is the reason that
      CHAR-ALPHABETIC? is described as "deprecated" -- new programs
      should avoid using this procedure and should, in most cases, use
      CHAR-LETTER? instead.  Programmers should be aware that the
      class CHAR-LETTER? may include letters such as syllables and
      ideographs which are not, in any sense, "alphabetic".  It can
      also include alphabetic characters which are neither upper or
      lowercase, lowercase letters with no uppercase form, uppercase
      letters with no lowercase form, lowercase characters which are
      not returned by CHAR-DOWNCASE of their CHAR-UPCASE mapping, and
      uppercase charactes which are not returned by CHAR-UPCASE of
      their CHAR-DOWNCASE mapping.  Programmers should also be aware
      that in some situations, a string may contain a letter followed
      by non-letters -- the sequence being "what a user would think of
      as a single letter" -- a fact which limits the utility of even
      CHAR-LETTER? unless additional facilities for text processing
      are provided by an implementation.  Yet at the same time, for
      the portable character set and for many extended characters,
      none of these peculiar circumstances apply -- programmers not
      trying to write "fully general" text processing algorithms can
      often ignore these complexities.   Programmers wanting to
      write "fully general" text algorithms, on the other hand, can
      define additional procedures which complement the standard
      character classes.

    > char-alphabetic?
    > char-numeric?
    > char-whitespace?
    > char-upper-case?
    > char-lower-case?

    >  These procedures return #t if their arguments are alphabetic,
    >  numeric, whitespace, uppercase, or lowercase characters, respectively.
    >  Otherwise they return #f. The characters a..z and A..Z are required to
    >  be alphabetic. The digits 0..9 must be numeric.  The space, newline, and
    >  tab characters must be whitespace.  The characters a..z are required to
    >  be lowercase.  The characters A..Z are required to be uppercase.  No
    >  character may be both uppercase and lowercase.

That's consistent with my proposed revisions.  I think CHAR-LETTER?
ought to be added and CHAR-ALPHABETIC? either dropped entirely or
mentioned as deprecated.  If it is mentioned as deprecated, the
invariant shown above should be stated here.  The corresponding
sentence in the definition of CHAR-UPCASE and CHAR-DOWNCASE should be
dropped.

    > char-cased?
    > char-uncased?

    >  Char-cased? returns #t if its argument is a character which conforms to
    >  "normal" case expectations, (see below) and #f otherwise. [....]

    > Rationale: This allows char-lower-case?, char-upper-case?, and
    > char-alphabetic? to go on meaning the same thing with respect to the
    > 96-character portable character set and meaning the same thing
    > linguists mean when they use these terms.  This will reduce confusion
    > in the long run.  This particular notion of cased and uncased
    > characters is also useful in other parts of the standard for saying
    > exactly which characters case requirements should apply to.  It leaves
    > implementors free to not sweat about what to do with identifiers
    > containing eszett, regardless of what they do with calls to
    > (char-upcase #\eszett).

Among the rationales:  I think this one is false (see above):

    > This particular notion of cased and uncased characters is also
    > useful in other parts of the standard for saying exactly which
    > characters case requirements should apply to.

The other rationales are are good reasons to say _something_ but I
don't think two new procedures are needed.  Instead, the possibilitiy
of oddly-cased characters can be explicitly mentioned in the
definitions of CHAR-LOWER-CASE?, CHAR-UPPER-CASE?, and CHAR-LETTER?.

(Additionally, CASED and UNCASED seems like poor names for the classes
of characters they describe.)

    >> With regard to [...] char-upcase and char-upcase

    >> It should say

    >> [....] char-upcase must map a..z to A..Z and
    >> char-downcase must map A..Z to a..z.

    > I would propose instead:

    >  [...] if char is alphabetic and cased, then the result of
    >  char-upcase is upper case and the result of char-downcase is
    >  lower case.

I'm not sure I see any value to the stronger requirement, especially
since CHAR-ALPHABETIC? should be deprecated and there is otherwise no
need to introduce the concept of a "cased" character.  Your
alternative is implied by the definition of CHAR-ALPHABETIC? I gave in
the draft -- but you've earlier convinced me to weaken that
definition.

    >> The introduction to strings [....] should say:

    >> Some of the procedures that operate on strings ignore the difference
    >> between strings in which upper and lower case variants of the same
    >> character occur in corresponding positions. The versions that ignore
    >> case have ``-ci'' (for ``case insensitive'') embedded in their
    >> names.

    > I would propose instead:

    >  Some of the procedures that operate on strings ignore the difference
    >  between upper and lower case cased characters. The versions that
    >  ignore case in cased characters have ``-ci'' (for ``case
    >  insensitive'') embedded in their names.

I believe that this should be true:

	(char=? #\dotless-i #\U+0131) => #t
	(char-ci=? #\I #\dotless-i) => #t

and that STRING-CI=? is just the string equivalence induced by
CHAR-CI=?.

However, #\dotless-i is not "cased" as you have defined it.  Are you
saying that #\dotless-i and #\I are not CHAR-CI=? or that STRING-CI=?
is not the equivalence induced by CHAR-CI=??  Either way: why in the
world do that?

-t
----
Like my work on GNU arch, Pika Scheme, and other technical contributions
to the public sphere?   Show your support!

https://www.paypal.com/xclick/business=lord%40emf.net&item_name=support+for+arch+and+other+free+software+efforts+by+tom+lord&no_note=1&tax=0&currency_code=USD

and

xxxxxx@emf.net for www.moneybookers.com payments.