Unicode and Scheme Tom Lord (07 Feb 2004 22:33 UTC)
Re: Permitting and Supporting Extended Character Sets: response. Tom Lord (09 Feb 2004 21:55 UTC)
Re: Unicode and Scheme bear (09 Feb 2004 05:26 UTC)
Re: Unicode and Scheme Tom Lord (09 Feb 2004 17:15 UTC)
Re: Unicode and Scheme bear (09 Feb 2004 20:47 UTC)

Re: Permitting and Supporting Extended Character Sets: response. Tom Lord 09 Feb 2004 22:10 UTC


    > From: bear <xxxxxx@sonic.net>

    > The result of case-mapping via char-ci=? only on cased characters is
    > that distinct identifiers written using these characters remain
    > distinct no matter what the preferred case of the implementation
    > is. That's the desirable, crucial property that I was trying to
    > capture with the distinction between cased and uncased characters.

I don't see why that property is crucial.

Here's where I am on these thing:  please have a look at the
"References" section of the "Unicode Identifiers" draft.  The
consortium has made recommendations for case-insensitive programming
languages.   I think we should follow those and I don't think that
they're consistent with what you are adocating.

    [re: symbol->string as applied to symbols returned by READ or
     by a symbol literal in source text]

    > Actually, the wording you proposed permits *any* character in the
    > string to be in the non-preferred case, as long as the string contains
    > one or more characters which are in the preferred case.

Not quite.

Cummulatively, the set of changes I've proposed:

1) require that for symbols with portable names (in which, therefore,
   all non-punctuation characters are latin letters), either all
   the letters are uppercase or they are all lowercase -- just as
   in R5RS.

2) are deliberately vague about the meaning "preferred case" for
   extended characters.  In this sense, yes -- an implementation can
   include (extended) characters in the opposite of the preferred case
   in the symbol name of a symbol returned by READ.   I believe that
   this permissiveness is essential for Unicode support unless we are
   to require implementations to restrict the lexical syntax of
   identifiers far beyond the restrictions recommended by Unicode.

    > I chose the properties of characters I called "cased" and "uncased"
    > carefully; the distinctions they make are necessary and sufficient to
    > allow implementations to detect which characters can safely be
    > regarded as cased characters in the normal sense,

I assume that you mean (Scheme) programs, not implementations.

Programs can already detect which characters are naively cased in the
sense of your terms.   That you are able to define your CHAR-CASED in
a few lines of R5RS illustrates that.

Your two procedures would be useful if they were, in fact, needed for
expository purposes elsewhere in teh spec -- but the two uses you put
them to were both needless.

    >> For example, a Unicode STRING->SYMBOL must not canonicalize its
    >> argument (and STRING=? is a codepoint-wise comparison).

    > No, string=? is, and should be, a character-wise comparison.  The only
    > ways to make it into a codepoint-wise comparison are to make
    > non-canonical strings inexpressible, or to expressly forbid conforming
    > with the unicode consortium's requirement which says that if the
    > combining codepoints _within_each_particular_combining_class_
    > following a base character are the same codepoints in the same
    > sequence, then the result "ought to be regarded as the same character"
    > regardless of the sequence of individual codepoints.

Since you (elsewhere) agree that:

	(string=? s (symbol->string (string->symbol s))) => #t

this disagreement is immaterial for the "Permitting and Supporting
Extended Character Sets" draft.   Please consider me to have written:

	For example, a Unicode STRING->SYMBOL _may_ wish to not
        canonicalize .....

and my point stands.

You'll want to take up this issue separately, in response to "Scheme
Characters as (Extended) Unicode Codepoints", I think.

    > IOW, because Macron and Cedilla are in different combining
    > classes, the sequences A, Macron, Cedilla and A, Cedilla, Macron
    > ought to be regarded as equal in a string comparison.

Not by STRING=? in a Scheme in which the strings are regarded as
codepoint sequences, since STRING=? is the equivalence relation
induced by CHAR=?.   And, incidentally, although that STRING=? is not
the linguistically sensitive string-equality relation that Unicode
defines, it _is_ a useful procedure to have around for _implementing_
Unicode text processes.

    > The view of characters as multi-codepoint sequences is the only way I
    > could find to comply with both this requirement in Unicode and the
    > R5RS requirement of string=? as a character-by-character comparison.

A simpler way is to say that STRING=? and the Unicode equivalence
relationship you have in mind are separate procedures.  That some
Unicode text processes are defined in terms of a codepoint-wise
STRING=?-style comparison is one reason why I like the design in
"Scheme Characters as (Extended) Unicode Codepoints".

    >> It's true that that is not how linguists use the term "alphabetic".

    >> It's also true that not all "letters", in the sense of Unicode, are
    >> alphabetic characters.   For example, ideographic characters are
    >> categorized in Unicode as "letters";  syllabaries are classified as
    >> "letters".

    > True.  If we are to keep char-alphabetic?, then perhaps we ought to
    > also proposing the addition of char-ideographic? and char-syllabic?
    > then char-letter?  could be defined as

    > (define char-letter? (lambda (c)
    >    (or (char-alphabetic? c)
    >        (char-ideographic? c)
    >        (char-syllabic? c))))

    > Under that system,

    > 1 - all four terms would be consistent with the use linguists make of
    >     them.

    > 2 - the more general class of 'letter' would be available, and properly
    >     defined, for unicode text processing.

    > 3 - the usefulness of primitives for parsing source text and other
    >     simple syntaxes would be preserved and expanded.

_IF_ it were possible to define CHAR-ALPHABETIC? in a way which was
both linguistically correct _and_ upwards compatible with R5RS then
perhaps that would be almost a good idea.  I say "almost" because
CHAR-IDEOGRAPHIC? and CHAR-SYLLABIC? add bloat and those plus
CHAR-ALPHABETIC? fails to be a complete enumeration of letter
types....

But CHAR-ALPHABETIC? is just a botch.  It can not be rescued.   All of
these character classes belong elsewhere, with different names -- in a
"Linguistic Text Processing" SRFI.

    >> Worse, a linguistically proper definition of CHAR-ALPHABETIC?
    >> would be upwards incompatible with R5RS which requires that
    >> alphabetic characters have upper and lowercase forms (which are
    >> themselves characters).

    > I think that it's better and simpler to simply decouple that
    > requirement from "alphabetic-ness" in R*RS.  The few things that
    > depend on it can be explicitly defined only on "cased" characters,
    > whatever you want to call them, with no damage to anything written
    > using the portable character set.  But in that case we need a
    > predicate to detect cased characters, and char-alphabetic? ain't
    > it.

A predicate to detect "cased" characters can be trivially synthesized
from CHAR-UPCASE, CHAR-DOWNCASE, and CHAR=?.   I see no need for it to
be required by R6RS.

Breaking CHAR-ALPHABETIC? in the way that you propose will not break
correct protable programs whose _input_data_ consists only of portable
characters, but it can break correct portable programs whose input
data includes extended characters.   There is no particular reason to
introduce that breakage.

    >> Now, having said all of that, the definition of CHAR-ALPHABETIC? could
    >> be improved: The possibilitiy of non-alphabetic letters with both
    >> upper and lowercase forms seems plausble to me (are there any in
    >> Unicode already?)

    > Some of the native american syllabaries ("CANADIAN SYLLABICS
    > ..." in the Unicode standard) are frequently written using
    > double-sized glyphs to start sentences; Unicode does not
    > currently recognize these as separate characters, calling them a
    > presentation form instead.

Technical disagreements aside, it's learning things like that (I'm
taking you at your word for now :-) makes Unicode work so much fun,
isn't it?

Along those lines: one of the Pika hackers recently stumbled across
U+0F33 ("TIBETAN DIGIT HALF ZERO") which is given the numeric value
"-1/2".  There's U+0F32 ("TIBETAN DIGIT HALF NINE") given the numeric
value "17/2" and some other mysteries.  I'm looking forward to
learning the notational system these are from :-)

    >> So, instead of that definition of CHAR-ALPHABETIC?
    >> I would agree to:

    >>      CHAR-ALPHABETIC? must be defined in such a way that
    >>      this is true of all characters:

    >>          (or (not (char-alphabetic? c))
    >>              (and (char-letter? c)
    >>                   (char-upper-case? (char-upcase c))
    >>                   (char-lower-case? (char-downcase c))))
    >>          => #t

    > This is nonsense.  Hebrew characters are clearly alphabetic, but just
    > as clearly have no concept of upper case and lower case.  The property
    > you are looking for here is whether a character is cased, and using
    > the word "alphabetic" to refer to that property will lead people astray.

No.   You are thinking that I am trying to make make CHAR-ALPHABETIC?
linguistically useful.   What I'm actually trying to do is to minimize
the degree to which CHAR-ALPHABETIC? is linguistically useless.  The
invariant above is in that spirit.

The requirements in R5RS for CHAR-ALPHABETIC? already make it
linguistic nonsense.   There's no hope for it.  Deprecating it is the
best thing.

    > Further, your definition does not capture the full range of what you
    > need to express when checking for this property; characters such as
    > dotless-i will be char-alphabetic? according to the definition above
    > while still capable of causing bugs with char-ci=? and case-blind
    > identifiers because they are not the preferred lowercase mappings of
    > their own preferred uppercase mappings.

I'm following the letter of the (deprecated, stupid) law.  R5RS does
_not_ require, _even_for_ CHAR-ALPHABETIC? _characters_, that:

	(char=? (char-downcase c) (char-downcase (char-upcase c)))
        => #t

Amazing but true.

    > All the latin alphabetic characters are included in the set of
    > cased characters, just as they are included in the worldwide set
    > of alphabetic characters.  What we are doing here is moving to a
    > superset of the currently defined set, so there is no more
    > upward compatibility issue in going to one superset than in
    > going to another. If the case requirements in R5RS are read as
    > applying to _cased_ characters, then all code presently extant
    > is conforming. If the case requirements in R5RS are read as
    > applying to _alphabetic_ characters, then all code presently
    > extant is conforming.

There is no need to introduce the (linguistically random) notion of
"cased character".   With the invariant I gave for CHAR-ALPHABETIC?,
correct, portable R5RS programs remain so.

    > I think the invariant you're trying to attach to char-alphabetic? does
    > not belong there.  Past standards writers have been looking at
    > [...]

Water under the bridge.   CHAR-ALPHABETIC? is broken.

R6RS should not attempt to provide comprehensive facilities for
Unicode text processing.   It should attempt to provide a minimum of
upward compatible character and string facilities which are a useful
_subset_ of Unicode text processing, close in informal meaning to what
the R5RS versions say.   My proposal does that.

    > Past standards writers have been looking at restricted set of
    > characters in which all alphabetic characters were also cased,
    > and they made a requirement which is appropriate only to cased
    > characters, mistakenly calling the class of characters it should
    > be applied to "alphabetic" because there were no counterexamples
    > in the set of characters under consideration.  The requirement
    > is valuable, and we should keep it, but we need to apply it to
    > the set of characters to which it properly belongs, and simply
    > accept the fact that not all alphabetic characters are cased.

The CHAR-ALPHABETIC? invariant that I gave is consistent with an
implementation that defines it for truly alphabetic characters that
are "cased" in the sense you have been using.  It's consistent with
R5RS.  It's a hopeless cause to try to require more from
CHAR-ALPHABETIC? than that and deprecating CHAR-ALPHABETIC? is
necessary.

    > I believe that

    > (char-ci=? #\I #\dotless-i) => #f

    > Because

    > (char=? (char-downcase #\I) #\dotless-i) => #f.

It's interesting that you're advocating a behavior which is contrary
to Unicode recommendations _and_ not required by R5RS.

    > If you have a system in which #\dotless-i and #\i are both treated as
    > cased characters whose uppercase is #\I, then two identifiers, one
    > written using a dotted lowercase i and one written using a dotless i,
    > can be confused with one another in an implementation whose preferred
    > case is uppercase.

This is a topic for discussion of the draft called "Unicode
Identifiers".

As I say in that draft: I need to do a bit more investigation at the
library but I did look into this specific issue when I wrote the
draft.   _As_I_recall_ (so take it with a grain of salt), the Unicode
recommendations for case-insensitive programming-language identifiers
say that:

	I
and
	i

are the same identifier but that:

	<dotless i>

is a different identifier.   Go figure.

    > #\dotless-i and #\I therefore ought not be regarded
    > as char-ci=? in any system which also regards #\i and #\I as
    > char-ci=?.

In my proposed revisions for R6RS, CHAR-CI=? has no relationship to
identifier equivalence _except_ for identifiers spelled using only
portable characters.   I don't think there is any other choice there
that is consistent with Unicode best practices.

    > It is true that

    > (char=? (char-upcase #\dotless-i) (char-upcase #\i) #\I) => #t,

    > But given that you want to require

    > (char=? (char-downcase #\I) #\dotless-i) => #f and
    > (char=? (char-downcase #\I) #\i) => #t,

    > it is clearly unsupportable to choose #\dotless-i over #\i as the
    > lower case character which is char-ci=? to #\I.

Yup.  And, indeed, that seems to be the recommendation from the
Unicode Consortium.  "Things should be as simple as possible ...."

-t

----

Like my work on GNU arch, Pika Scheme, and other technical contributions
to the public sphere?   Show your support!

https://www.paypal.com/xclick/business=lord%40emf.net&item_name=support+for+arch+and+other+free+software+efforts+by+tom+lord&no_note=1&tax=0&currency_code=USD

and

xxxxxx@emf.net for www.moneybookers.com payments.