Re: the discussion so far

Re: the discussion so far Alex Shinn 19 Jul 2005 03:04 UTC
On 7/16/05, Matthew Flatt <xxxxxx@cs.utah.edu> wrote:
>
> I am personally convinced (by this discussion and by past experience)
> that `string-ci=?' as defined in the SRFI is not what you really want
> under most circumstances. But it's often a good approximation. I think
> that Scheme needs at least an operation like `string-ci=?' for portable
> programs, something like it will exist in most implementations, it's
> simple to implement, and it's consistent with the rest of the proposal
> ---- so it still seems right to me to put it in the SRFI, despite its
> many flaws.

This seems to me a very disturbing precedent, if R6RS will deliberately
specify "the wrong thing" just because it cannot figure out "the right
thing."  In a case like this, I think it would be better not to specify
anything.

So far, the following conflicts exist between R5RS (ASCII) and Unicode:

1) Case operations.  In ASCII, CHAR-*CASE and STRING-*CASE can be
defined in terms of one another, and either can be used to define
CHAR-CI=?  or STRING-CI=?, or these can be provided separately.  In
ASCII, at the very least STRING-CI=? is required for basic parsing of
text formats (though in reality it is possible that such parsing will be
done directly on octet-arrays rather than native strings).  CHAR-CI=?
and CHAR-*CASE can be used to improve efficiency in low-level libraries
such as parsers and regex matchers.

In Unicode, the character-level operations have no meaning.  It is in
fact far, far worse to provide character-level operations that "do as
much as they can" given the Unicode data for single characters, because
then you encourage people to use this feature, but by definition any
program that performs case operations at the character level is broken
and can never be fixed or extended.  Also, STRING-CI=?  cannot be
defined in terms of STRING-*CASE, and both are locale-specific
procedures.

There are two ways to reconcile these differences.  One is to unify the
procedures, the other is to create separate procedures.  Unified should
be something like:

  CHAR-*CASE, CHAR-CI=?
    - as in R5RS
    - folds ASCII *only* (please don't enourage bad code)

  STRING-*CASE, STRING-CI=?
    - takes optional locale argument (which may be ignored)
    - guaranteed to work at least on ASCII characters as in R5RS
    - optionally Unicode aware

The alternative would be to provide the above procedures as
ASCII-centric only, and provide new procedures for linguistic case
operations, such as

  TEXT-*CASE str [locale]
  TEXT-FOLD-CASE str [locale]    ; faster for multiple comparisons
  TEXT-CI=? str [locale]

2) Collation.  This is insanely complicated.  Apart from being highly
locale dependent, there are an infinite number of variations one may
want, such as ignoring leading "The" in English, sorting sequences of
digits as numbers, and so on.  In Japanese collation is an AI-complete
problem, and attempts to do it properly could require options as to what
dictionary to use, among others.  To unify these one would expect
something like:

  STRING<? a b [keyword-arguments ...]

or as a separate procedures

  TEXT<? a b [keyword-arguments ...]

and/or something like

  STRING-COLLATE list-of-strings [keyword-arguments ...]

Here it is very important to be able to sort lexically on code-points
for use in efficient and portable search tree and database algorithms.
Therefore, if STRING<? is overloaded, one should expect a keyword
argument such as

  (STRING<? a b 'lexical: #t)

to force this standardized sorting.

3) Normalization.  Implementations should be allowed to internally keep
all strings normalized, and/or perform normalized comparisons for the
likes of STRING=?.  Unlike subjective collation, one should generally
want to treat strings as identical to their normalized forms, however in
implementing certain low-level libraries (such as encoding converters)
one may want explicit control of code-point equivalence.  Furthermore
one may want library procedures to convert between different normalized
forms, however if the implementation always maintains a specific
internal normalization, conversions to other normalizations by
definition cannot return native strings.  There is a lot remaining to
discuss and experiment in this area, but for the time being separating
STRING=? from TEXT=? may be a consideration.

4) Definition of a character.  Rather than pure code-point iterators,
one may wish to define operations on sequences of code-points
representing higher-level linguistic components (e.g. join accent marks)
or even glyphs (returning ligatures, or consonant plus vowels for
scripts like Thai).  Fortunately these can be defined in terms of
Unicode code-points, so they can be relegated to future libraries.

It should be kept in mind that we will likely always have Schemes
implementing the full range of Unicode support, from nothing to complete
(and then some), just as we have variation in the range of number tower
support.  However, unlike in the number tower, the ASCII vs. Unicode
distinctions are semantically different, and even in an implementation
supporting all of Unicode, you may wish to use ASCII-specific procedures
which ignore or possibly even throw an error in the presence of
non-ASCII data.  Moreover, there is a very significant overhead in the
Unicode versions, both in speed and in memory usage.  Because using an
ASCII-level operation makes clear what you are doing (non-linguistic
parsing), and because it can be significantly more efficient, I'm
leaning towards providing separate procedures.

--
Alex