Re: character strings versus byte strings

Show/hide message thread

character strings versus byte strings Matthew Flatt (22 Dec 2003 14:16 UTC)
Re: character strings versus byte strings Per Bothner (22 Dec 2003 17:09 UTC)
Re: character strings versus byte strings Matthew Flatt (22 Dec 2003 17:23 UTC)
Re: character strings versus byte strings tb@xxxxxx (22 Dec 2003 20:23 UTC)
(missing)
(missing)
Re: character strings versus byte strings Tom Lord (22 Dec 2003 22:36 UTC)
Re: character strings versus byte strings tb@xxxxxx (22 Dec 2003 22:41 UTC)
Re: character strings versus byte strings Shiro Kawai (22 Dec 2003 23:00 UTC)
Re: character strings versus byte strings Michael Sperber (23 Dec 2003 09:36 UTC)

Re: character strings versus byte strings Per Bothner 22 Dec 2003 17:09 UTC

Matthew Flatt wrote:

>  * Where "char *" is used for strings (e.g., "expected_explanation" for
>    a type error), define it to be an ASCII or Latin-1 encoding (I
>    prefer the latter).

No, it should be UTF-8.

>  * For Scheme characters, pick a specific encoding, probably one of
>    UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right
>    choice).

Standardizing a specific encoding either forces Scheme implementations
to standardize encodings internally or force force expensive conversions.

[Slightly off-topic - I doubt anybody will follow my recommendation.]

But if you're going to pick an encoding, I think UTF-8 is "right" -
except for old APIs.  (You can't do random access from a character
number, but there is never any actual need for that.  You need
sequential access plus random access to previouly seen characters,
which byte offsets give you.)

A preceived problem with using UTF-8 is that you can't replace
a 1-byte character by a 3-byte character.  But that is just a
symptom of another problem:  a fixed-size mutable "string" is
a useless data structure, only useful for implementing higher
level data structures.

So if I was designing a Scheme dialect for internationalization,
I'd do away with mutable strings.  You'd have uniform byte arrays
(for implementation) and "texts".  The latter are implemented
using a byte buffer with a gap (as in an Emacs buffer).  Constant
strings are a special case of texts.

For compatibility with old Scheme code that uses character indexes,
a "string" would be a text with a 1-element index cache to map a
character index to a buffer index.
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/