Strings/chars | Simplelists

Strings/chars Shiro Kawai 22 Dec 2003 21:51 UTC
Handling large character sets in 'proper' way is a difficult
task, but I think we can assume there will be layers (e.g.
octets, codepoints, graphemes, ...), and the current
level of string/character abstraction may be retained in some way.
So I focus on how to interface with R5RS-level Scheme string/character
object.

Before going details, I point out that if we want maximum
portability, we probably should stick to one encoding,
"ucs4 character and utf8 string", and let implementations to
handle all conversion works.  I'll discuss it later.

First, I'll go through APIs with assumption that we want
efficient treatment of native Scheme strings.

- A Scheme character may be just an octet, or an immediate
  object that fits in a word, or a multi-word object.
  (In Gauche, a Scheme character is an immeidate object, and fits
  in a machine word.)

  Thus SCHEME_STRING_REF should return an implementation-dependent
  type scheme_char or something.  So do SCHEME_EXTRACT_CHAR.
  SCHEME_ENTER_CHAR and SCHEME_MAKE_STRING would take scheme_char.

  If a Scheme character is a multi-word object,  SCHEME_STRING_REF
  may invoke GC.  Or if a scheme_char can be allocated on C stack,
  SCHEME_STRING_REF might receive a region to store the result.
    SCHEME_STRING_REF(scheme_value str, long k, scheme_char *ch)
  to avoid unnecessary allocation on such implementations.
  But it is less efficient in the implementations that uses just
  an octet per character.

- A Scheme string may not be an array of scheme_char objects.
  (In Gauche, Scheme string uses multibyte encoding, i.e. each
  character occupies different number of bytes.)
  So it is a good question that SCHEME_EXTRACT_STRING and
  SCHEME_ENTER_STRING should use scheme_char* or char*.
  In wide-character string implementations, scheme_char*
  would be much more efficient; in multi-bypte implementations,
  char* would be much more efficient.

- The body of Scheme string may be read-only (it is so in Gauche,
  and it may be shared by may Scheme strings), and/or it may
  consist of chunks of memory.  In such implementatinos:

  -- SCHEME_STRING_SET may invoke GC, and potentially
     very inefficient.
  -- To return a mutable (char *) string, SCHEME_EXTRACT_STRING
     may need to allocate memory and copy the content to it.
     Returning (const char *) can be cheaper.

- Preventing SCHEME_ENTER_STRING from creating a string that
  includes NUL character seems an unnecessary restriction.
  Passing length as well enables including NUL character.
  However, we need to specify that the "length" is whether number
  of octets or number of characters.

- If the implementation has sharable string body, it is useful
  to tell SCHEME_ENTER_STRING whether it should copy the content
  or not, so that it can avoid unnecessary copy.

- SCHEME_GET_IMPORTED_BINDING and SCHEME_DEFINE_EXPORTED_BINDING
  take char*.  Implementation may use internal Scheme string to
  represent the names of symbols.  So we need to specify a
  clear mapping between them.
  The safest way is to limit binding names within ASCII.
  (A bit off-topic, but why these API takes char*, instead of const char*?).

For me, native Scheme string represetation can vary too much
to have one single efficient C API.   However, if we think
this srfi to ease writing portable "bindings" to other existing
C libraries, then we have to convert internal Scheme string
to C char* of well-known encodings anyway.
If so, the most practical choice of encoding would be ucs4 character
and utf8 string (although it isn't the case in my daily working
environment).

So, suppose we have something like these:

 typedef long ucs4char;
 typedef char * ucs8string;

And the APIs are:

 ucs4char SCHEME_EXTRACT_CHAR(scheme_value); (may GC)
 scheme_value SCHEME_ENTER_CHAR(ucs4char);  (may GC)

 utf8string* SCHEME_EXTRACT_STRING(scheme_value);   (may GC)
 scheme_value SCHEME_ENTER_STRING(const utf8string*, long); (may GC)
   /* always copy the passed string */

 ucs4char SCHEME_STRING_REF(scheme_value, long);  (may GC)
 void SCHEME_STRING_SET(scheme_value, long, ucs4char); (may GC)
 scheem_value SCHEME_MAKE_STRING(long, ucs4char); (may GC)

Note that SCHEME_EXTRACT_CHAR may invoke GC as well, since
the implementation may need to run a character-code conevrsion
routine which needs a dynamic buffer.

Furthermore, I think there need to be a way to extract
"raw" byte stream of the internal string body; the
above API assumes all the Scheme strings are convertible
to Unicode strings, and in reality it is not true.
So something like these would help.

 char *SCHEME_EXTRACT_STRING_RAW(scheme_value);  (may GC)
 scheme_value SCHEME_ENTER_STRING_RAW(const char*, long); (may GC)

And also I think it's reasonable to have read-only reference.
(it is debatable wether we should make this default and have
mutable reference optional).

 const utf8string* SCHEME_EXTRACT_STRING_CONST(scheme_value); (may GC)
 const char *SCHEME_EXTRACT_STRING_RAW_CONST(scheme_value); (may GC)

--shiro