Re: Update, near finalization bear 13 Apr 2005 18:44 UTC


On Mon, 11 Apr 2005, Aubrey Jaffer wrote:

>But this pissing contest should be largely irrelevant to R6RS --
>SRFI-63 is more capable (uniform arrays), better integrated with R5RS
>(specifying vector, string, and EQUAL? behavior), compatible with
>SRFI-58 array syntax, and better designed than SRFI-25.

Actually, I think that specifying string behavior in a document
about arrays is a mistake.

The operations we want to do on strings are in many cases
fundamentally different from the operations that are efficient to do
on arrays.  For example, inserting or deleting text at an arbitrary
point in the string are two of the most fundamental operations in text
editing, and if strings are represented as arrays, they require the
program to touch all subsequent characters in the string every time
one character is inserted or deleted.

The results can be clearly seen in many editors.  Load something even
a little bit hefty like, say, Unidata.txt, into your favorite editor
and, even with a nice fast cpu, inserting or deleting text near the
beginning will lag significantly.

This happens because the authors of text editors, mostly, looked
around for the fundamental abstraction that most closely matched the
data they wanted to work with, and found strings.  Rather than
reinvent the (in this case square) wheel, they blithely abstracted
operations that hide gross inefficiencies and use them to build yet
higher and more grossly inefficient abstractions, because, hey,
strings are "free" in terms of implementation effort, and nobody wants
to reinvent the wheel, even if it's a square wheel.  So strings, which
the authors of the language probably thought about in terms of a
thousand characters or less, are increasingly used to represent
multi-million character documents, and array strings simply don't fit
the requirements.

If you specify that strings must be arrays, you are forbidding
implementors from using more appropriate data structures (such as
balanced trees) to represent them.  You are also strongly implying
that characters must be of a fixed bit width, which forbids
implementors from creating characters at a level of abstraction higher
than individual codepoints. In the case of unicode, codepoints
frequently aren't proper characters at all, but modifier letters or
nonspacing accents that make no semantic sense by themselves.  Thus,
operations like taking or inserting a substring, if it cuts between
codepoints that belong to the same character, can easily create
apparent nonsense at the beginning or end of the substring.  This is
an error, easy to make, that ought not be the default behavior.  Also,
there are many different sequences of unicode codepoints that express
the same actual character, and these are also either needless code
complexity or mistakes waiting to happen, or both, in string equality
comparisons and similar operations.

These are, IMO, errors that people ought to be, at least by default,
protected from making.  These should not be the normal behavior of
strings; And in asserting that strings must be arrays, you are
requiring an implementation that requires these error-prone situations
to be exposed to the programmer.

				Bear