Issues with Unicode
Jonathan S. Shapiro
(23 Apr 2006 08:55 UTC)
|
Re: Issues with Unicode
Marc Feeley
(23 Apr 2006 13:26 UTC)
|
Re: Issues with Unicode
Shiro Kawai
(26 Apr 2006 06:27 UTC)
|
Re: Issues with Unicode Taylor R. Campbell (26 Apr 2006 07:50 UTC)
|
Re: Issues with Unicode
Shiro Kawai
(26 Apr 2006 22:21 UTC)
|
Re: Issues with Unicode
Jorgen Schaefer
(26 Apr 2006 23:40 UTC)
|
Date: Tue, 25 Apr 2006 20:27:07 -1000 (HST) From: Shiro Kawai <xxxxxx@lava.net> Alternative implementations of strings have been discussed in this list, and some threads in comp.lang.scheme, I think. I'd like to draw attention to one point which hasn't been raised, IIRC. (Maybe it is too trivial and everybody knows about it; if so, sorry for the noise.) I can't recall whether this ever came up here, but last year, when this SRFI was still fresh and under heavy discussion, I wrote up an alternative proposal for a Unicode-supporting -- although *not* Unicode-mandating -- string API, where strings are collections of grapheme clusters indexed by opaque cursors, not character indices, and whose binary encoding is separated into BLOB->STRING and STRING->BLOB[!] procedures and abstracted by text codec descriptors. The text of the document is here: <http://mumble.net/~campbell/proposals/alt-text.text>. It came out of many extensive discussions with John Cowan, Jorgen Schaefer, and probably a number of other persons whom I've forgotten by now. Here are some of the most important points about it, off the top of my head: 1. It doesn't require Unicode support, for instance in the Scheme system that runs on your doorknob. More seriously, the API simply does not specify anything about particular code point mappings or text codecs other than that ASCII must be supported. 2. It's high-level. We can sweep things like normalization wholly under the rug with it. We needn't mandate a particular internal string representation; the API would work just as well with all strings as UTF-8 strings internally, as with all strings as UTF-32 strings internally, as with all strings as pairs of text codec and actual storage internally. 3. Further on the point that we can use UTF-8 internally: not only does it permit efficient variable-width string representations such as UTF-8 -- because strings are indexed by opaque cursors which may be stepped as octet indices for constant-time access, while natural number indices of characters would require O(n) access time --, but higher-level text structures, such as grapheme clusters or words or sentences or paragraphs, would require explicit stepping like with string cursors anyway, or O(n) access times. 4. Strings are immutable. The application of mutability in old R5RS-style strings was extremely limited, anyway: you can change existing characters, but you can't insert or delete, so you can't, say, change a whole _word_ in a string, if the substitute has a length different from the original. It just so happened that all characters had the same width in all practical implementations, so we could swap in new ones as we pleased, but this assumption doesn't hold up very well if we want to extend our text-processing capabilities beyond that limited world and to higher-level text structures such as words and sentences. Also, because strings are immutable, we can more safely share storage, and it is not unreasonable to mandate the existence of an O(1) STRING-SLICE procedure, like SRFI 13's SUBSTRING/SHARED. There are, of course, still some problems with it. I couldn't think of a good literal syntax, for instance. However, I think the basic idea of the proposal is a considerable improvement over the current, historically motivated, mutable character vector model of strings. Some of the fancier implementations might not go well with preemptive multithreads; if mutation of string touches more than one place of the string objects, it creates a hazard. While I agree that strings ought to be immutable, as you recommended afterward, I don't think this is really a very good reason: I can't imagine why anyone would *want* to share a mutable string between threads badly enough for synchronization to be the default. (It might be convenient to have mutable strings for editor-like applications; which also allow length-changing mutation. I'd rather think it to be another type of object that can be built on top of immutable strings; e.g. a buffer object realized by a balanced tree of string segments). This would definitely be useful. It would also definitely fall outside the scope of basic Unicode support in R6RS, so I think SRFI 75 shouldn't even try to specify any mutable string data in general.