Re: Issues with Unicode bear 10 May 2006 01:58 UTC

I'm going to try to organize what I have to say and just
make it a bullet list instead of responding to a half-dozen
different things.

Immutable strings - With Unicode and threads, it's the only
viable implementation strategy.  The hairy underpinnings and
implementation strategies in a threaded unicode environment,
if you don't want locking to absolutely thrash, will use
purely functional structures and lock-free algorithms.

But the interface is not the implementation.  Once you've
done the legwork for immutable strings, providing
string-set!  and similar is a very short further trip.
Constructing a new string every time a mutation happens
seems like a big chore, but if you're using a good
implementation, you'll be able to do it in O(1) or O(log(n))
with a log base that's pretty large, and reuse 99%+ of the
structure of the extant string.

Under the heading of "what is efficient should be very easy
to use", and given that immutable strings *will* be more
efficient in unicode-enabled multithreaded schemata, we need
a family of string functions that explicitly return new
strings leaving the extant string unchanged rather than
doing any kind of mutation.  But there is plenty of time for
this; we don't have to get them into R6RS, and in fact we
need R6RS' unicode specs before they can be written.

Removing string-set! would be way too much of a flag-day for
existing scheme code.  You'd have to spend R7RS implementing
all kinds of functional-string operations that we don't have
now because they're not sensible for mutable array-based
strings, add some you probably missed in R8RS, start
deprecating string-set!  in R9RS, and then you can *think*
about making strings immutable starting with R10RS or R11RS.
It would be far more sensible for a new dialect of lisp to
just not have string-set! than for scheme to try to get rid
of it.

Someone has asserted that indexing by grapheme cluster
implies stepping and is O(n).  This is only true for strings
implemented as C implements them.  It is perfectly feasible
to have characters inside your implementation always
represented by constant-width values, using a subrange of
those values as indexes into a table of interned characters.
You intern any characters falling into that subrange and any
characters too wide for your fixed-width units. Then you
index in O(1) and don't worry at all about whether some
characters are big.  Of course, in such an implementation
indexing by codepoint would be problematic... the
double-indexing (both codepoint and grapheme cluster are
indexed) methods I can think of right now are all O(log(n)).

Regarding what ought to be legal as an identifier: I think
control characters, whitespace (properties Zs, Zl, Zp) and
delimiters (properties Ps, and Pc) ought not appear in
identifiers.  I wouldn't be at all upset if a standard also
forbade combining characters; after all, identifiers and
symbol names don't need the full functionality of strings.
I wouldn't be at all upset of a standard also forbade all
characters not yet assigned as of Unicode 4.1.0, with
the implication that this forbidding would be permanent
across Scheme report revisions, even though later Unicode
versions doubtless will come along.

I think you ought to be able to tell in the first three
characters that a symbol is not a number.  Right now the
standard says the first one character, and I'm okay with
that too.

I think that the standard needs to explicitly specify a
method for partitioning the set of all strings into a few
groups:

   * things an implementation is required to accept as identifiers
   * things an implementation is free to accept as identifiers
   * things an implementation may not accept as numbers or identifiers
      (reserved for non-numeric
   * things an implementation is free to accept as numbers
   * things an implementation is required to accept as numbers

   (etc...  required and permitted syntaxes for booleans,
characters, other constant formats. )

The first category would be simple, like the BNF from RNRS.
The second category would be things that, *if* they appear
in a program, *must* be either identifiers or syntax errors.
You shove 'em off here because some folks may want them as
identifiers whilst others hate what the pattern-matching for
them does to their parsers.  (an example would be my
alternate three-characters rule for determining that
something isn't a number).  The third category is for
extensions and unspecified kinds of data; here's where I'd
be using those openings and closings and initial and final
punctuations I wanted kept out of identifiers.  The fourth
category would be for classes of strings reserved for
numeric extensions and numeric constant formats that
implementations aren't required to implement.  And the fifth
would be for numeric formats that all implementations *are*
required to implement.

Right now we have a minimal set of identifiers guaranteed to
be legal identifiers (the BNF for symbols in R5RS) and a
very minimal set of numeric syntax that's required to be
accepted (a limited range of integers, with the limit
unspecified). And every implementation makes its own calls
as to how to interpret all other tokens.  21/22 may be an
identifier in a scheme that doesn't support rationals, and
#q2+2i+j+4k may be a number in a scheme that supports
quaternions, and #33.28 may be a boolean in a system with
fuzzy logic extensions.  There is absolutely *nothing* you
can do to define a syntactic class of objects that is
guaranteed to *not* step on number or identifier syntax in
any scheme.  This is a mess, inhibits innovation, and
introduces bugs into code that works fine elsewhere, and the
standard *does* need to help sort it out.