Re: Update, near finalization bear 13 Apr 2005 18:44 UTC
On Mon, 11 Apr 2005, Aubrey Jaffer wrote: >But this pissing contest should be largely irrelevant to R6RS -- >SRFI-63 is more capable (uniform arrays), better integrated with R5RS >(specifying vector, string, and EQUAL? behavior), compatible with >SRFI-58 array syntax, and better designed than SRFI-25. Actually, I think that specifying string behavior in a document about arrays is a mistake. The operations we want to do on strings are in many cases fundamentally different from the operations that are efficient to do on arrays. For example, inserting or deleting text at an arbitrary point in the string are two of the most fundamental operations in text editing, and if strings are represented as arrays, they require the program to touch all subsequent characters in the string every time one character is inserted or deleted. The results can be clearly seen in many editors. Load something even a little bit hefty like, say, Unidata.txt, into your favorite editor and, even with a nice fast cpu, inserting or deleting text near the beginning will lag significantly. This happens because the authors of text editors, mostly, looked around for the fundamental abstraction that most closely matched the data they wanted to work with, and found strings. Rather than reinvent the (in this case square) wheel, they blithely abstracted operations that hide gross inefficiencies and use them to build yet higher and more grossly inefficient abstractions, because, hey, strings are "free" in terms of implementation effort, and nobody wants to reinvent the wheel, even if it's a square wheel. So strings, which the authors of the language probably thought about in terms of a thousand characters or less, are increasingly used to represent multi-million character documents, and array strings simply don't fit the requirements. If you specify that strings must be arrays, you are forbidding implementors from using more appropriate data structures (such as balanced trees) to represent them. You are also strongly implying that characters must be of a fixed bit width, which forbids implementors from creating characters at a level of abstraction higher than individual codepoints. In the case of unicode, codepoints frequently aren't proper characters at all, but modifier letters or nonspacing accents that make no semantic sense by themselves. Thus, operations like taking or inserting a substring, if it cuts between codepoints that belong to the same character, can easily create apparent nonsense at the beginning or end of the substring. This is an error, easy to make, that ought not be the default behavior. Also, there are many different sequences of unicode codepoints that express the same actual character, and these are also either needless code complexity or mistakes waiting to happen, or both, in string equality comparisons and similar operations. These are, IMO, errors that people ought to be, at least by default, protected from making. These should not be the normal behavior of strings; And in asserting that strings must be arrays, you are requiring an implementation that requires these error-prone situations to be exposed to the programmer. Bear