Re: Surrogates and character representation
Per Bothner 24 Jul 2005 18:25 UTC
Tom Emerson wrote:
> Representing strings internally in UTF-8 is a loss though, since you
> lose random access to the string.
Random access to a previously accessed position works just fine - just
use the byte offset.
Random accesses to a position in a string that has not been previously
accessed is not in itself useful.
> For some applications this isn't a big deal, but in general using UTF-8
> as an internal representation is a bad idea.
It's the other way round. Using UTF-8 as in internal representation is
just fine for *applications*. The problem is that certain *API*s have a
concept of indexing into a string, and unfortunately R5RS is one of
them. In itself indexing of strings is a useless feature, as it can be
replaced by a sequential-access cursor/iterator API - but historically
the Scheme cursor/iterator API uses integers for the "cursor". And
existing code moves the "cursor" forwards by adding 1.
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/