Re: character strings versus byte strings
Per Bothner 22 Dec 2003 17:09 UTC
Matthew Flatt wrote:
> * Where "char *" is used for strings (e.g., "expected_explanation" for
> a type error), define it to be an ASCII or Latin-1 encoding (I
> prefer the latter).
No, it should be UTF-8.
> * For Scheme characters, pick a specific encoding, probably one of
> UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right
> choice).
Standardizing a specific encoding either forces Scheme implementations
to standardize encodings internally or force force expensive conversions.
[Slightly off-topic - I doubt anybody will follow my recommendation.]
But if you're going to pick an encoding, I think UTF-8 is "right" -
except for old APIs. (You can't do random access from a character
number, but there is never any actual need for that. You need
sequential access plus random access to previouly seen characters,
which byte offsets give you.)
A preceived problem with using UTF-8 is that you can't replace
a 1-byte character by a 3-byte character. But that is just a
symptom of another problem: a fixed-size mutable "string" is
a useless data structure, only useful for implementing higher
level data structures.
So if I was designing a Scheme dialect for internationalization,
I'd do away with mutable strings. You'd have uniform byte arrays
(for implementation) and "texts". The latter are implemented
using a byte buffer with a gap (as in an Emacs buffer). Constant
strings are a special case of texts.
For compatibility with old Scheme code that uses character indexes,
a "string" would be a text with a 1-element index cache to map a
character index to a buffer index.
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/