REPOST: Permitting and Supporting Extended Character Sets: response. bear 10 Feb 2004 02:22 UTC

<Previously posted on SRFI-50>

On Mon, 9 Feb 2004, Tom Lord wrote:

>A while back, tb argued that the case-mapping procedures of R5RS could
>simply be dropped.   There's something to that.

It's true, there's something.  They are, at the very least, library
rather than core procedures.  Still, the idea of dropping them is a
non-starter.

>Why do that?  I'm not convinced we should but the arguments for doing
>so would include:
>
>~ it would remove from R5RS all traces of the naive approach
>  to character case

This is not desirable, unless provided in a standard library.

>~ it would remove from R5RS the culturally biased character
>  class "alphabetic"

Likewise.

>~ it would evaded the tricky problem of define "numeric" usefully
>  yet without cultural bias

FWIW, I do not see how this would be achieved by the proposal nor why
it could not otherwise be achieved.

>~ the basic structure of the CHAR? type, a well-ordered set isomorphic
>  to a subset of the integers, would be retained

Think carefully about your "subset."  It is not a subrange, because
there are many codepoints which are mapped to no character in unicode
(ie, there are "gaps" in the set).  Nor is it (necessarily) finite.
If the subset is neither finite nor a subrange, there is very little
value in such an isomorphic mapping.

>Why not do it?

>
>~ pedagogical reasons -- for the portable character set, the
>  metacircularity procedures can be defined using the dropped
>  procedures

>~ practical reason -- it wouldn't leave enough standard machinary in
>  Scheme to parse simple formats like "whitespace separated fields"

Both of the above points boil down to, the definitions embodied by
these procedures are used in other parts of the standard.  If they
were removed, then the additional information would need to be added
to other parts of the standard.

>~ practical reason -- implementors will want to provide all of the
>  procedures in the DROP column for years to come, at least.  Useful
>  libraries will continue to rely on them.  It is worthwhile to
>  (continue to) say what they should mean.

And of course, this is a trump as far as I'm concerned.

>But, on to the proposed revisions to the proposed revisions to the
>revised^5 specification:

>    >> It should say:
>
>    >> Returns the name of symbol as a string. [...] will be in the
>    >> implementation's preferred standard case [...]
>    >> will prefer upper case, others lower case. If the symbol was
>    >> returned by string->symbol, [....] string=? to
>    >> the string that was passed to string->symbol. [....]
>
>
>    > I would propose instead:
>
>    >  Returns the name of symbol as a string. [...] all cased
>    >  characters in the identifier (see the definition of char-cased?
>    >  for a precise definition of cased and uncased characters) will
>    >  be in the implementation's preferred standard case [....].  If
>    >  the symbol was returned by string->symbol, the case of the
>    >  characters in the string returned will be the same as the case
>    >  in the string that was passed to string->symbol. [....]
>
>    > Rationale; I think it's simply clearer.  The above wording
>    > specifically permits uncased characters (ie, characters which do not
>    > conform to "normal" expectations of cased characters) to be present
>    > in lowercase in identifiers even if the preferred case is uppercase,
>    > and presumably vice versa.
>
>Huh.  I thought that my wording permitted that already.  I mostly
>dislike your wording.

Actually, the wording you proposed permits *any* character in the
string to be in the non-preferred case, as long as the string contains
one or more characters which are in the preferred case.

>This part:
>
>    >  all cased characters in the identifier [...] will be in the
>    >  implementation's preferred standard case
>
>seems too strong to me.  I'd be willing to accept it if (a) we nail a
>good STRING->SYMBOL-NAME definition for the "Unicode Identifiers"
>draft; (b) prove that the property you named is true for that
>STRING->SYMBOL-NAME and for all future versions of Unicode.

I chose the properties of characters I called "cased" and "uncased"
carefully; the distinctions they make are necessary and sufficient to
allow implementations to detect which characters can safely be
regarded as cased characters in the normal sense, and also admit
character sets which include linguistically uppercase or lowercase
characters which do not have proper case mappings.  If proper case
mappings for such characters are added to the character set, they
become "cased," and identifiers containing them can change form to
comply with the implementation's preferred case; but this does not
occur at the risk of distinctions between distinct identifers being
lost.

>
>This part:
>
>    >  If the symbol was returned by string->symbol, the case of the
>    >  characters in the string returned will be the same as the case
>    >  in the string that was passed to string->symbol.
>
>is too weak.  The two strings must be STRING=?.

I think I'll agree with this point.  The strings must, in fact, be
string=?.

> For example, a
>Unicode STRING->SYMBOL must not canonicalize its argument (and
>STRING=? is a codepoint-wise comparison).

No, string=? is, and should be, a character-wise comparison.  The only
ways to make it into a codepoint-wise comparison are to make
non-canonical strings inexpressible, or to expressly forbid conforming
with the unicode consortium's requirement which says that if the
combining codepoints _within_each_particular_combining_class_
following a base character are the same codepoints in the same
sequence, then the result "ought to be regarded as the same character"
regardless of the sequence of individual codepoints.  IOW, because
Macron and Cedilla are in different combining classes, the sequences
A, Macron, Cedilla and A, Cedilla, Macron ought to be regarded as
equal in a string comparison.

The view of characters as multi-codepoint sequences is the only way I
could find to comply with both this requirement in Unicode and the
R5RS requirement of string=? as a character-by-character comparison.

Further, both requirements are useful.

>    > This is not how linguists use the term "alphabetic."  Please do
>    > not propose "alphabetic" as a procedure to use to mean this, as
>    > it will frustrate and confuse people.
>
>It's true that that is not how linguists use the term "alphabetic".
>
>It's also true that not all "letters", in the sense of Unicode, are
>alphabetic characters.   For example, ideographic characters are
>categorized in Unicode as "letters";  syllabaries are classified as
>"letters".

True.  If we are to keep char-alphabetic?, then perhaps we ought to
also proposing the addition of char-ideographic? and char-syllabic?
then char-letter?  could be defined as

(define char-letter? (lambda (c)
   (or (char-alphabetic? c)
       (char-ideographic? c)
       (char-syllabic? c))))

Under that system,

1 - all four terms would be consistent with the use linguists make of
    them.

2 - the more general class of 'letter' would be available, and properly
    defined, for unicode text processing.

3 - the usefulness of primitives for parsing source text and other
    simple syntaxes would be preserved and expanded.

>Worse, a linguistically proper definition of CHAR-ALPHABETIC? would be
>upwards incompatible with R5RS which requires that alphabetic
>characters have upper and lowercase forms (which are themselves
>characters).

I think that it's better and simpler to simply decouple that
requirement from "alphabetic-ness" in R*RS.  The few things that
depend on it can be explicitly defined only on "cased" characters,
whatever you want to call them, with no damage to anything written
using the portable character set.  But in that case we need a
predicate to detect cased characters, and char-alphabetic? ain't it.

>Now, having said all of that, the definition of CHAR-ALPHABETIC? could
>be improved: The possibilitiy of non-alphabetic letters with both
>upper and lowercase forms seems plausble to me (are there any in
>Unicode already?)

Some of the native american syllabaries ("CANADIAN SYLLABICS ..." in
the Unicode standard) are frequently written using double-sized glyphs
to start sentences; Unicode does not currently recognize these as
separate characters, calling them a presentation form instead.

> So, instead of that definition of CHAR-ALPHABETIC?
>I would agree to:
>
>      CHAR-ALPHABETIC? must be defined in such a way that
>      this is true of all characters:
>
>          (or (not (char-alphabetic? c))
>              (and (char-letter? c)
>                   (char-upper-case? (char-upcase c))
>                   (char-lower-case? (char-downcase c))))
>          => #t

This is nonsense.  Hebrew characters are clearly alphabetic, but just
as clearly have no concept of upper case and lower case.  The property
you are looking for here is whether a character is cased, and using
the word "alphabetic" to refer to that property will lead people astray.

Further, your definition does not capture the full range of what you
need to express when checking for this property; characters such as
dotless-i will be char-alphabetic? according to the definition above
while still capable of causing bugs with char-ci=? and case-blind
identifiers because they are not the preferred lowercase mappings of
their own preferred uppercase mappings.

All the latin alphabetic characters are included in the set of cased
characters, just as they are included in the worldwide set of
alphabetic characters.  What we are doing here is moving to a superset
of the currently defined set, so there is no more upward compatibility
issue in going to one superset than in going to another. If the case
requirements in R5RS are read as applying to _cased_ characters, then
all code presently extant is conforming.  If the case requirements in
R5RS are read as applying to _alphabetic_ characters, then all code
presently extant is conforming.

>That's consistent with my proposed revisions.  I think CHAR-LETTER?
>ought to be added and CHAR-ALPHABETIC? either dropped entirely or
>mentioned as deprecated.  If it is mentioned as deprecated, the
>invariant shown above should be stated here.  The corresponding
>sentence in the definition of CHAR-UPCASE and CHAR-DOWNCASE should be
>dropped.

I think the invariant you're trying to attach to char-alphabetic? does
not belong there.  Past standards writers have been looking at a
restricted set of characters in which all alphabetic characters were
also cased, and they made a requirement which is appropriate only to
cased characters, mistakenly calling the class of characters it should
be applied to "alphabetic" because there were no counterexamples in
the set of characters under consideration.  The requirement is
valuable, and we should keep it, but we need to apply it to the set of
characters to which it properly belongs, and simply accept the fact
that not all alphabetic characters are cased.

>    > char-cased?
>    > char-uncased?
>
>    >  Char-cased? returns #t if its argument is a character which conforms to
>    >  "normal" case expectations, (see below) and #f otherwise. [....]
>
>    > Rationale: This allows char-lower-case?, char-upper-case?, and
>    > char-alphabetic? to go on meaning the same thing with respect to the
>    > 96-character portable character set and meaning the same thing
>    > linguists mean when they use these terms.  This will reduce confusion
>    > in the long run.  This particular notion of cased and uncased
>    > characters is also useful in other parts of the standard for saying
>    > exactly which characters case requirements should apply to.  It leaves
>    > implementors free to not sweat about what to do with identifiers
>    > containing eszett, regardless of what they do with calls to
>    > (char-upcase #\eszett).
>
>Among the rationales:  I think this one is false (see above):
>
>    > This particular notion of cased and uncased characters is also
>    > useful in other parts of the standard for saying exactly which
>    > characters case requirements should apply to.
>
>The other rationales are are good reasons to say _something_ but I
>don't think two new procedures are needed.  Instead, the possibilitiy
>of oddly-cased characters can be explicitly mentioned in the
>definitions of CHAR-LOWER-CASE?, CHAR-UPPER-CASE?, and CHAR-LETTER?.

I think it is, in fact, vital.  These two predicates precisely capture
the set of characters that the case relationships in R5RS can be
meaningfully applied to.  These are necessary and sufficient
relationships for the normal meanings of char-ci=?, string-ci=?, etc,
to apply, and correctly capture the necessary properties for case
insensitivity for identifiers.

>(Additionally, CASED and UNCASED seems like poor names for the classes
>of characters they describe.)

I'm not terribly attached to the names.  Feel free to suggest
alternatives.

>    >> [....] char-upcase must map a..z to A..Z and
>    >> char-downcase must map A..Z to a..z.
>
>    > I would propose instead:
>
>    >  [...] if char is alphabetic and cased, then the result of
>    >  char-upcase is upper case and the result of char-downcase is
>    >  lower case.
>
>I'm not sure I see any value to the stronger requirement, especially
>since CHAR-ALPHABETIC? should be deprecated and there is otherwise no
>need to introduce the concept of a "cased" character.

I think I'm with you on this; alphabetic-ness isn't the important
property here. it should probably read,

[..] if char is cased, then the result of char-upcase is upper case
and the result of char-downcase is lower case.

>    >> The introduction to strings [....] should say:
>
>    >> Some of the procedures that operate on strings ignore the difference
>    >> between strings in which upper and lower case variants of the same
>    >> character occur in corresponding positions. The versions that ignore
>    >> case have ``-ci'' (for ``case insensitive'') embedded in their
>    >> names.
>
>    > I would propose instead:
>
>    >  Some of the procedures that operate on strings ignore the difference
>    >  between upper and lower case cased characters. The versions that
>    >  ignore case in cased characters have ``-ci'' (for ``case
>    >  insensitive'') embedded in their names.
>
>I believe that this should be true:
>
>	(char=? #\dotless-i #\U+0131) => #t
>	(char-ci=? #\I #\dotless-i) => #t
>
>and that STRING-CI=? is just the string equivalence induced by
>CHAR-CI=?.

I believe that

(char-ci=? #\I #\dotless-i) => #f

Because

(char=? (char-downcase #\I) #\dotless-i) => #f.

>However, #\dotless-i is not "cased" as you have defined it.  Are you
>saying that #\dotless-i and #\I are not CHAR-CI=? or that STRING-CI=?
>is not the equivalence induced by CHAR-CI=??  Either way: why in the
>world do that?

Dotless-i is not cased because it is not stable under case mappings.
It is not the preferred lowercase form of its own preferred uppercase
form.

(char=? #\dotless-i (char-downcase (char-upcase #\dotless-i))) => #f

If you have a system in which #\dotless-i and #\i are both treated as
cased characters whose uppercase is #\I, then two identifiers, one
written using a dotted lowercase i and one written using a dotless i,
can be confused with one another in an implementation whose preferred
case is uppercase. #\dotless-i and #\I therefore ought not be regarded
as char-ci=? in any system which also regards #\i and #\I as
char-ci=?.  It is true that

(char=? (char-upcase #\dotless-i) (char-upcase #\i) #\I) => #t,

But given that you want to require

(char=? (char-downcase #\I) #\dotless-i) => #f and
(char=? (char-downcase #\I) #\i) => #t,

it is clearly unsupportable to choose #\dotless-i over #\i as the
lower case character which is char-ci=? to #\I.

Note: I think that all scheme code should be read and written using
some standard locale like the "C locale" for portability, and I think
that it should be a locale in which (char-ci=? #\i #\I) => #t.

It is possible, however, that in some locales #\dotless-i would be a
cased character, because it would be the preferred lowercase form of
its own preferred uppercase form.  In those locales, however, #\i
would be a cased character if and only if it had the same reciprocal
relationship with a _different_ upper case character, most likely
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.

The result of case-mapping via char-ci=? only on cased characters is
that distinct identifiers written using these characters remain
distinct no matter what the preferred case of the implementation
is. That's the desirable, crucial property that I was trying to
capture with the distinction between cased and uncased characters.

				Bear