Re: encoding strings in memory

Show/hide message thread
encoding strings in memory bburger@xxxxxx (13 Jul 2005 14:39 UTC)
Re: encoding strings in memory Per Bothner (13 Jul 2005 16:22 UTC)
Re: encoding strings in memory Per Bothner 13 Jul 2005 16:21 UTC
xxxxxx@beckman.com wrote:
> 1. Strings will almost certainly have to be represented as arrays of
> 32-bit entities, since string-set! allows one to whack any character.
>  This representation wastes memory, since the overwhelmingly common case
> is to use characters only from the Basic Multilingual Plane (0x0000 to
> 0xFFFF).  For applications we write, the majority of characters are
> ASCII, even though our software is used around the world.  Consequently,
> we use UTF-8 for storing strings, even though we run on Microsoft
> Windows (UTF-16-LE).

We have teh same problem in the Java world.  Native strings and
characters are 16-bit Unicode.  This would fine 99% of the time.
However, use of character above 0xFFFF requires using surrogate pairs.

The problem is string-ref and string-set!.  Existing Java-String-based
encodings have string-ref return *half* of a surrogate pair.  This is no
problem for most applications, if you just want to print or copy
strings.  It's not really a problem for intelligent code that deals with
composed characters which needs to work with variable-length strings
anyway.  It is a problem for intermediate code that does something with
each individual character.

Note that even these applications don't actually need a linear mapping
from indexes to characters.  I.e. arithmetic on indexes in a string is
never (well, hardly ever) useful or meaningful.  All we need is a
"position" magic cookie, similar to stdio's fpos_t.

One solution is to have multiple "modes".  A string may start out in
8-bit mode, and switch to 16-bit code when a 16-bit character is
inserted, and then switch to 32-bit mode when a still larger character
is inserted.  This means the entire string has to be copied when a
single character is inserted, but the amortized cost per character is
constant.  It also means that we need 32- bits per character for the
entire string, even if there is only a single character > 0xFFFF.

> 2. Changing strings to use 32-bit characters will make foreign function
> interfaces difficult, since the major platforms use UTF-16-LE and UTF-8.
>  It will also break all existing foreign-function code that relies on
> strings being 8-bit bytes.

The "mode-switching" solution doesn't solve that problem - it makes it
worse.

> It seems to me that keeping char 8-bit and string as an array of 8-bit
> bytes would be the least disruptive change.

But what does char-ref return?

I have an idea; see next message.
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/