Re: encoding strings in memory
Per Bothner 13 Jul 2005 16:21 UTC
xxxxxx@beckman.com wrote:
> 1. Strings will almost certainly have to be represented as arrays of
> 32-bit entities, since string-set! allows one to whack any character.
> This representation wastes memory, since the overwhelmingly common case
> is to use characters only from the Basic Multilingual Plane (0x0000 to
> 0xFFFF). For applications we write, the majority of characters are
> ASCII, even though our software is used around the world. Consequently,
> we use UTF-8 for storing strings, even though we run on Microsoft
> Windows (UTF-16-LE).
We have teh same problem in the Java world. Native strings and
characters are 16-bit Unicode. This would fine 99% of the time.
However, use of character above 0xFFFF requires using surrogate pairs.
The problem is string-ref and string-set!. Existing Java-String-based
encodings have string-ref return *half* of a surrogate pair. This is no
problem for most applications, if you just want to print or copy
strings. It's not really a problem for intelligent code that deals with
composed characters which needs to work with variable-length strings
anyway. It is a problem for intermediate code that does something with
each individual character.
Note that even these applications don't actually need a linear mapping
from indexes to characters. I.e. arithmetic on indexes in a string is
never (well, hardly ever) useful or meaningful. All we need is a
"position" magic cookie, similar to stdio's fpos_t.
One solution is to have multiple "modes". A string may start out in
8-bit mode, and switch to 16-bit code when a 16-bit character is
inserted, and then switch to 32-bit mode when a still larger character
is inserted. This means the entire string has to be copied when a
single character is inserted, but the amortized cost per character is
constant. It also means that we need 32- bits per character for the
entire string, even if there is only a single character > 0xFFFF.
> 2. Changing strings to use 32-bit characters will make foreign function
> interfaces difficult, since the major platforms use UTF-16-LE and UTF-8.
> It will also break all existing foreign-function code that relies on
> strings being 8-bit bytes.
The "mode-switching" solution doesn't solve that problem - it makes it
worse.
> It seems to me that keeping char 8-bit and string as an array of 8-bit
> bytes would be the least disruptive change.
But what does char-ref return?
I have an idea; see next message.
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/