encoding strings in memory
bburger@xxxxxx
(13 Jul 2005 14:39 UTC)
|
Re: encoding strings in memory Per Bothner (13 Jul 2005 16:22 UTC)
|
Re: encoding strings in memory Per Bothner 13 Jul 2005 16:21 UTC
xxxxxx@beckman.com wrote: > 1. Strings will almost certainly have to be represented as arrays of > 32-bit entities, since string-set! allows one to whack any character. > This representation wastes memory, since the overwhelmingly common case > is to use characters only from the Basic Multilingual Plane (0x0000 to > 0xFFFF). For applications we write, the majority of characters are > ASCII, even though our software is used around the world. Consequently, > we use UTF-8 for storing strings, even though we run on Microsoft > Windows (UTF-16-LE). We have teh same problem in the Java world. Native strings and characters are 16-bit Unicode. This would fine 99% of the time. However, use of character above 0xFFFF requires using surrogate pairs. The problem is string-ref and string-set!. Existing Java-String-based encodings have string-ref return *half* of a surrogate pair. This is no problem for most applications, if you just want to print or copy strings. It's not really a problem for intelligent code that deals with composed characters which needs to work with variable-length strings anyway. It is a problem for intermediate code that does something with each individual character. Note that even these applications don't actually need a linear mapping from indexes to characters. I.e. arithmetic on indexes in a string is never (well, hardly ever) useful or meaningful. All we need is a "position" magic cookie, similar to stdio's fpos_t. One solution is to have multiple "modes". A string may start out in 8-bit mode, and switch to 16-bit code when a 16-bit character is inserted, and then switch to 32-bit mode when a still larger character is inserted. This means the entire string has to be copied when a single character is inserted, but the amortized cost per character is constant. It also means that we need 32- bits per character for the entire string, even if there is only a single character > 0xFFFF. > 2. Changing strings to use 32-bit characters will make foreign function > interfaces difficult, since the major platforms use UTF-16-LE and UTF-8. > It will also break all existing foreign-function code that relies on > strings being 8-bit bytes. The "mode-switching" solution doesn't solve that problem - it makes it worse. > It seems to me that keeping char 8-bit and string as an array of 8-bit > bytes would be the least disruptive change. But what does char-ref return? I have an idea; see next message. -- --Per Bothner xxxxxx@bothner.com http://per.bothner.com/