Re: Strings/chars Jim Blandy 08 Jan 2004 08:09 UTC

I think it's important to work well with C as it is, and not try to
correct C's problems in the SRFI API.  C doesn't specify a particular
character set.  It distinguishes the character set used in the source
code (the "source character set") from the character set used in the
running program (the "execution character set"), and specifies a
limited set of characters that both must contain (the "basic character
set").

The simplest and easiest ways to access Scheme strings' contents and
Scheme characters should return the corresponding strings and
characters in the C program's execution character set.  That is:

- Extracting the first character of the Scheme string "z" had better
  yield the C character 'z'.

- I shouldn't have to mention any character set by name, or do any
  kind of character set conversions at all, to write C code that
  checks whether a given string's contents are "foo".

Implementing this behavior can't be a burden on the Scheme
implementation, since it had to get all the data from the outside
system anyway, and it was almost certainly already in the program's C
execution character set when it arrived.  So if the Scheme system
doesn't actually use the C execution character set itself, it must
already have mechanisms for converting to and from that character set.

The next case to support is the "Scheme execution character set",
where you just return the data in whatever form is cheapest and
easiest for the Scheme system, once you've flattened it out into an
array of characters or wide characters.  You can't assume any
relationship between this form and the C program's execution character
set, of course, but you can at least pass it through without paying
for conversions you don't need, or wondering if the round trip is
going to munge anything.  And, when you don't care about writing code
portable to other Schemes, you can operate on the data directly.

Only after those two cases are covered should one move on to providing
ways to reliably get UTF-8, UCS-4, or whatever you like.

That's my two cents, anyway.