Re: the discussion so far bear 19 Jul 2005 09:30 UTC


On Tue, 19 Jul 2005, Alex Shinn wrote:

> This seems to me a very disturbing precedent, if R6RS will deliberately
> specify "the wrong thing" just because it cannot figure out "the right
> thing."  In a case like this, I think it would be better not to specify
> anything.

It would certainly be more consistent with the spirit of past
RnRS reports to avoid specifying anything.  The current committee
seems to have a different goal than past committees though; If
their goal is making portable code possible rather than waiting
for consensus and specifying only the Right Thing, they will be
making the same kind of compromises just in order to specify
*something* that, say, the Common Lisp standard does.

> So far, the following conflicts exist between R5RS (ASCII) and Unicode:

> 1) Case operations.  In ASCII, CHAR-*CASE and STRING-*CASE can be
> defined in terms of one another, and either can be used to define
> CHAR-CI=?  or STRING-CI=?, or these can be provided separately.  In
> ASCII, at the very least STRING-CI=? is required for basic parsing
> of text formats (though in reality it is possible that such parsing
> will be done directly on octet-arrays rather than native strings).
> CHAR-CI=?  and CHAR-*CASE can be used to improve efficiency in
> low-level libraries such as parsers and regex matchers.

This problem is much less severe if your characters are grapheme
clusters.  But thanks to Ligatures (which cannot appear in canonical
strings) and eszett (which, unfortunately, can) it is not completely
eliminated by moving to grapheme clusters.

> In Unicode, the character-level operations have no meaning.  It is
> in fact far, far worse to provide character-level operations that
> "do as much as they can" given the Unicode data for single
> characters, because then you encourage people to use this feature,
> but by definition any program that performs case operations at the
> character level is broken and can never be fixed or extended.  Also,
> STRING-CI=?  cannot be defined in terms of STRING-*CASE, and both
> are locale-specific procedures.

Right; substrings that aren't valid strings, or which combine into
something that isn't the original string, can result when you split
grapheme clusters; This happens when you take substrings on arbitrary
codepoint boundaries, or do buffered operations on arbitrary codepoint
boundaries, or any of a number of other things.  But these are
problems that go away if your characters are grapheme clusters.

> There are two ways to reconcile these differences.  One is to unify the
> procedures, the other is to create separate procedures.  Unified should
> be something like:

>  CHAR-*CASE, CHAR-CI=?
>    - as in R5RS
>    - folds ASCII *only* (please don't enourage bad code)

I don't believe in this.  If you're going to limit it to ASCII,
then 'ascii' ought to be in its name.

>  STRING-*CASE, STRING-CI=?
>    - takes optional locale argument (which may be ignored)
>    - guaranteed to work at least on ASCII characters as in R5RS
>    - optionally Unicode aware

The thing is, if underspecified these operations will be
nearly useless.  Portable code will be unable to rely on
them doing any particular thing.

> Normalization.  Implementations should be allowed to internally keep
> all strings normalized, and/or perform normalized comparisons for
> the likes of STRING=?.  Unlike subjective collation, one should
> generally want to treat strings as identical to their normalized
> forms, however in implementing certain low-level libraries (such as
> encoding converters) one may want explicit control of code-point
> equivalence.  Furthermore one may want library procedures to convert
> between different normalized forms, however if the implementation
> always maintains a specific internal normalization, conversions to
> other normalizations by definition cannot return native strings.

It's my opinion that the only way to make normalization transparent to
the programmer and user is to use grapheme-cluster characters instead
of codepoint characters.  Normalization consists in altering codepoint
sequences within grapheme clusters only; if this is your character
unit, then it can be done without disrupting character indexes or
counts, saving everyone a lot of headaches.

One thing about normalization: Ligatures do not exist in normalized
text, because they have canonical decompositions.  If you are using
normalized grapheme clusters as characters, then there are no
ligatures, and most of specialcasing.txt specifies upcase and downcase
operations that do in fact take one character and return one character
(alas, they are not all or necessarily reciprocal relationships, or
the problem would be truly simple.) The sole exception is eszett.

As for conversion between different normalized forms, I think that the
unicode normalization form is properly a property of the port through
which data is read or written.  The port reads codepoints in some
normalization form, and delivers _characters_ represented according to
the abstraction you use internally.  Likewise, it accepts abstract
characters and writes codepoints in some normalization form.  This
introduces a distinction between text ports (which read and write
characters, full-stop) and binary ports (which read and write octets).
If you want to read or write characters on a binary port, you *SHOULD*
have to state explicitly what encoding to use.

> Definition of a character.  Rather than pure code-point iterators,
> one may wish to define operations on sequences of code-points
> representing higher-level linguistic components (e.g. join accent
> marks) or even glyphs (returning ligatures, or consonant plus vowels
> for scripts like Thai).  Fortunately these can be defined in terms
> of Unicode code-points, so they can be relegated to future
> libraries.

Well, I'm going to argue pretty strenuously in favor of grapheme
clusters as characters, for all the reasons mentioned above.

> It should be kept in mind that we will likely always have Schemes
> implementing the full range of Unicode support, from nothing to
> complete (and then some), just as we have variation in the range of
> number tower support.  However, unlike in the number tower, the
> ASCII vs. Unicode distinctions are semantically different, and even
> in an implementation supporting all of Unicode, you may wish to use
> ASCII-specific procedures which ignore or possibly even throw an
> error in the presence of non-ASCII data.  Moreover, there is a very
> significant overhead in the Unicode versions, both in speed and in
> memory usage.  Because using an ASCII-level operation makes clear
> what you are doing (non-linguistic parsing), and because it can be
> significantly more efficient, I'm leaning towards providing separate
> procedures.

I'd tend to agree.  If the encoding is an important part of the
definition of any operation, I'd suggest strongly that the name
of the encoding appear in the name of the operation.  Thus,
char->integer returns "an integer" but if you want to specify
a routine that returns "a unicode codepoint" I think its name
should be char->unicode.

			Bear