Repost: Permitting and Supporting Extended Character Sets: response.

Repost: Permitting and Supporting Extended Character Sets: response. bear 10 Feb 2004 02:15 UTC
<previously posted to srfi-50.  See discussion there.>

On Mon, 9 Feb 2004, Tom Lord wrote:

>    > From: bear <xxxxxx@sonic.net>
>
>    > The result of case-mapping via char-ci=? only on cased characters is
>    > that distinct identifiers written using these characters remain
>    > distinct no matter what the preferred case of the implementation
>    > is. That's the desirable, crucial property that I was trying to
>    > capture with the distinction between cased and uncased characters.
>
>I don't see why that property is crucial.
>
>Here's where I am on these thing:  please have a look at the
>"References" section of the "Unicode Identifiers" draft.  The
>consortium has made recommendations for case-insensitive programming
>languages.   I think we should follow those and I don't think that
>they're consistent with what you are adocating.

They are, very slightly, a superset of the unicode consortium's
reccomendations.  A scheme implementing that property will be
consistent with Unicode's reccomendations, but Unicode's
reccomendations are not entirely adequate to describe that property.

I believe that the additional restrictions become necessary because
the consortium's recommendations are not entirely adequate for the
needs of programming languages in which identifiers can be manipulated
as strings or result from calculation on strings, and where operations
such as string-ci=? are expected to be able to detect identifiers
which are "the same" identifier.  Scheme is such a language.

>    > I chose the properties of characters I called "cased" and "uncased"
>    > carefully; the distinctions they make are necessary and sufficient to
>    > allow implementations to detect which characters can safely be
>    > regarded as cased characters in the normal sense,
>
>I assume that you mean (Scheme) programs, not implementations.
>
>Programs can already detect which characters are naively cased in the
>sense of your terms.   That you are able to define your CHAR-CASED in
>a few lines of R5RS illustrates that.

While the definitions are implementable in a few lines of R5RS, the
point is that developers realizing they need such a predicate are
likely to implement it as you implemented your version of
char-alphabetic? - without realizing that the "simplest" definition
does not in fact describe characters having a one-to-one
correspondence between lowercase and uppercase characters and
therefore is not sufficient to preserve the portability of their
programs from harm.  It is better to provide this definition in a
standard and nail its meaning down rather than allowing its necessity
to drive the creation of many incompatible and/or buggy versions.

> consider me to have written:
>	For example, a Unicode STRING->SYMBOL _may_ wish to not
>        canonicalize .....
>
>and my point stands.
>
>You'll want to take up this issue separately, in response to "Scheme
>Characters as (Extended) Unicode Codepoints", I think.
>
>
>    > IOW, because Macron and Cedilla are in different combining
>    > classes, the sequences A, Macron, Cedilla and A, Cedilla, Macron
>    > ought to be regarded as equal in a string comparison.
>
>Not by STRING=? in a Scheme in which the strings are regarded as
>codepoint sequences, since STRING=? is the equivalence relation
>induced by CHAR=?.

no...

(Char=? #\A:Macron:Cedilla #\A:Cedilla:Macron) => #t

(= (char->integer #\A:Macron:Cedilla)
   (char->integer #\A:Cedilla:Macron)) => #t

(String=? "arf\(U+41:Macron:Cedilla)arf"
          "arf\(U+41:Cedilla:Macron)arf") => #t

string=? is in fact the equivalence relation induced by char=?.

You are, I expect, running into a problem I don't experience because
you prefer a representation that requires individual combining
codepoints to occupy separate, distinguishable locations in a string,
and as a result you are setting up a situation in which
autocanonicalization cannot be done transparently.  implementations
that conform to your proposal will need to take extra steps
(canonicalization, etc) to conform to the consortium's definition of
string equality.

> And, incidentally, although that STRING=? is not the linguistically
> sensitive string-equality relation that Unicode defines, it _is_ a
> useful procedure to have around for _implementing_ Unicode text
> processes.

Please humor me by not banning schemes in which string=? can be both.

> _IF_ it were possible to define CHAR-ALPHABETIC? in a way which was
> both linguistically correct _and_ upwards compatible with R5RS then
> perhaps that would be almost a good idea.  I say "almost" because
> CHAR-IDEOGRAPHIC? and CHAR-SYLLABIC? add bloat and those plus
> CHAR-ALPHABETIC? fails to be a complete enumeration of letter
> types....

> But CHAR-ALPHABETIC? is just a botch.  It can not be rescued.  All
> of these character classes belong elsewhere, with different names --
> in a "Linguistic Text Processing" SRFI.

If you don't care to rescue it, then at least try to avoid abusing it
further.  I'd rather drop it all together rather than forcing this
case mapping property that doesn't belong with it onto it.

Char-alphabetic?  is properly, and should be, of exactly the same
stature as char-ideographic?  or char-syllabic? or (just remembered
this) char-phonemic? and maybe a few others.  If any of these don't
belong in the standard, then none of these belong in the standard.
They can be reintroduced as library procedures in a language-handling
library, if and when that becomes necessary.

Maybe char-letter?, completely devoid of case requirements, is in fact
all that the standard needs.

> A predicate to detect "cased" characters can be trivially
> synthesized from CHAR-UPCASE, CHAR-DOWNCASE, and CHAR=?.  I see no
> need for it to be required by R6RS.

It can be trivially synthesized, but more than half of the people who
do it will do it with slightly different semantics if a precise
definition is not given.

> Breaking CHAR-ALPHABETIC? in the way that you propose will not break
> correct protable programs whose _input_data_ consists only of
> portable characters, but it can break correct portable programs
> whose input data includes extended characters.  There is no
> particular reason to introduce that breakage.

Can you give an example?

>  You are thinking that I am trying to make make CHAR-ALPHABETIC?
> linguistically useful.  What I'm actually trying to do is to
> minimize the degree to which CHAR-ALPHABETIC? is linguistically
> useless.  The invariant above is in that spirit.

> The requirements in R5RS for CHAR-ALPHABETIC? already make it
> linguistic nonsense.  There's no hope for it.  Deprecating it is the
> best thing.

You may be right; I'd prefer to see it excised completely from the
standard rather than preserved with this bizarre case requirement.  I
consider it nonsensical to say "this character fails to behave
according to these expectations for cased characters and therefore we
will call it non-alphabetic even though it is part of an alphabet."

Even if previous editions of the standard presumed that all alphabetic
characters were cased, this is breakage.  You need to identify the set
of characters that behave as previous editions of the standard assumed
"alphabetic" characters behaved, but "alphabetic" is not the right
word to describe those characters. "Char-alphabetic?" should be simply
clarified NOT to be a description of case properties, although there
are no counterexamples to such a reading in the portable character
set.  As a general description of characters having these case
properties, a properly-named predicate should be introduced instead.
This permits "alphabetic" to retain its case semantics over at least
the portable character set, (which are all that portable programs have
ever relied on), without abandoning its linguistic meaning.

>    > Further, your definition does not capture the full range of what you
>    > need to express when checking for this property; characters such as
>    > dotless-i will be char-alphabetic? according to the definition above
>    > while still capable of causing bugs with char-ci=? and case-blind
>    > identifiers because they are not the preferred lowercase mappings of
>    > their own preferred uppercase mappings.

>I'm following the letter of the (deprecated, stupid) law.  R5RS does
>_not_ require, _even_for_ CHAR-ALPHABETIC? _characters_, that:

>	(char=? (char-downcase c) (char-downcase (char-upcase c)))
>        => #t

>Amazing but true.

It does not require it explicitly but it depends on it for the correct
reading of identifiers which are not in the implementation's preferred
case.  Amazing but true.

>There is no need to introduce the (linguistically random) notion of
>"cased character".   With the invariant I gave for CHAR-ALPHABETIC?,
>correct, portable R5RS programs remain so.

The invariant you gave for CHAR-ALPHABETIC? is not merely
linguistically random. As applied to an extended character set, it is
linguistically wrong.  It is incorrect. It is false.

Moreover, It allows merging of identifiers which should not be merged
when those identifiers contain CHAR-ALPHABETIC?  (your definition)
characters which are not CHAR-CASED? (my definition) and their case
mapping properties interact badly with the implementation's preferred
case.

Therefore, it does not have the properties you claim for it for all
possible character sets. It happens to have those properties for the
portable character set, but its definition is not adequate to assure
them.  If you desire those properties, you will have to use a
definition like the one I proposed for CHAR-CASED?, whatever you
choose to call it.

>R6RS should not attempt to provide comprehensive facilities for
>Unicode text processing.   It should attempt to provide a minimum of
>upward compatible character and string facilities which are a useful
>_subset_ of Unicode text processing, close in informal meaning to what
>the R5RS versions say.   My proposal does that.

I do not believe that it does. Setting aside for the moment the fact
that attaching the case invariants to char-alphabetic? is incorrect,
you have not identified the correct set of case invariants needed for
character-insensitive identifiers to remain distinct in correct,
portable programs.

>The CHAR-ALPHABETIC? invariant that I gave is consistent with an
>implementation that defines it for truly alphabetic characters that
>are "cased" in the sense you have been using.  It's consistent with
>R5RS.  It's a hopeless cause to try to require more from
>CHAR-ALPHABETIC? than that and deprecating CHAR-ALPHABETIC? is
>necessary.

The invariant you gave is necessary, but not sufficient.  It
identifies characters which have both lowercase and uppercase
forms, but it does not identify characters which are part of a
reciprocal 1-to-1 case mapping.

				Bear