Re: Issues with Unicode

Show/hide message thread
Issues with Unicode Jonathan S. Shapiro (23 Apr 2006 08:55 UTC)
Re: Issues with Unicode Marc Feeley (23 Apr 2006 13:26 UTC)
Re: Issues with Unicode Shiro Kawai (26 Apr 2006 06:27 UTC)
Re: Issues with Unicode Taylor R. Campbell (26 Apr 2006 07:50 UTC)
Re: Issues with Unicode Shiro Kawai (26 Apr 2006 22:21 UTC)
Re: Issues with Unicode Jorgen Schaefer (26 Apr 2006 23:40 UTC)
Re: Issues with Unicode Taylor R. Campbell 26 Apr 2006 07:48 UTC
   Date: Tue, 25 Apr 2006 20:27:07 -1000 (HST)
   From: Shiro Kawai <xxxxxx@lava.net>

   Alternative implementations of strings have been discussed in
   this list, and some threads in comp.lang.scheme, I think.
   I'd like to draw attention to one point which hasn't been
   raised, IIRC.  (Maybe it is too trivial and everybody knows
   about it; if so, sorry for the noise.)

I can't recall whether this ever came up here, but last year, when
this SRFI was still fresh and under heavy discussion, I wrote up an
alternative proposal for a Unicode-supporting -- although *not*
Unicode-mandating -- string API, where strings are collections of
grapheme clusters indexed by opaque cursors, not character indices,
and whose binary encoding is separated into BLOB->STRING and
STRING->BLOB[!] procedures and abstracted by text codec descriptors.
The text of the document is here:

  <http://mumble.net/~campbell/proposals/alt-text.text>.

It came out of many extensive discussions with John Cowan, Jorgen
Schaefer, and probably a number of other persons whom I've forgotten
by now.  Here are some of the most important points about it, off the
top of my head:

 1. It doesn't require Unicode support, for instance in the Scheme
    system that runs on your doorknob.  More seriously, the API simply
    does not specify anything about particular code point mappings or
    text codecs other than that ASCII must be supported.

 2. It's high-level.  We can sweep things like normalization wholly
    under the rug with it.  We needn't mandate a particular internal
    string representation; the API would work just as well with all
    strings as UTF-8 strings internally, as with all strings as UTF-32
    strings internally, as with all strings as pairs of text codec and
    actual storage internally.

 3. Further on the point that we can use UTF-8 internally: not only
    does it permit efficient variable-width string representations
    such as UTF-8 -- because strings are indexed by opaque cursors
    which may be stepped as octet indices for constant-time access,
    while natural number indices of characters would require O(n)
    access time --, but higher-level text structures, such as grapheme
    clusters or words or sentences or paragraphs, would require
    explicit stepping like with string cursors anyway, or O(n) access
    times.

 4. Strings are immutable.  The application of mutability in old
    R5RS-style strings was extremely limited, anyway: you can change
    existing characters, but you can't insert or delete, so you can't,
    say, change a whole _word_ in a string, if the substitute has a
    length different from the original.  It just so happened that all
    characters had the same width in all practical implementations, so
    we could swap in new ones as we pleased, but this assumption
    doesn't hold up very well if we want to extend our text-processing
    capabilities beyond that limited world and to higher-level text
    structures such as words and sentences.  Also, because strings are
    immutable, we can more safely share storage, and it is not
    unreasonable to mandate the existence of an O(1) STRING-SLICE
    procedure, like SRFI 13's SUBSTRING/SHARED.

There are, of course, still some problems with it.  I couldn't think
of a good literal syntax, for instance.  However, I think the basic
idea of the proposal is a considerable improvement over the current,
historically motivated, mutable character vector model of strings.

   Some of the fancier implementations might not go well with
   preemptive multithreads; if mutation of string touches more
   than one place of the string objects, it creates a hazard.

While I agree that strings ought to be immutable, as you recommended
afterward, I don't think this is really a very good reason: I can't
imagine why anyone would *want* to share a mutable string between
threads badly enough for synchronization to be the default.

   (It might be convenient to have mutable strings for editor-like
   applications; which also allow length-changing mutation.  I'd
   rather think it to be another type of object that can be built
   on top of immutable strings; e.g. a buffer object realized by
   a balanced tree of string segments).

This would definitely be useful.  It would also definitely fall
outside the scope of basic Unicode support in R6RS, so I think SRFI 75
shouldn't even try to specify any mutable string data in general.