constant-time access to variable-width encodings

Show/hide message thread

constant-time access to variable-width encodings Per Bothner (13 Jul 2005 18:13 UTC)

Re: constant-time access to variable-width encodings Ray Blaak (13 Jul 2005 18:48 UTC)

Re: constant-time access to variable-width encodings Shiro Kawai (13 Jul 2005 20:16 UTC)

Re: constant-time access to variable-width encodings Per Bothner (13 Jul 2005 20:36 UTC)

Re: constant-time access to variable-width encodings Shiro Kawai (13 Jul 2005 23:07 UTC)

Re: constant-time access to variable-width encodings bear (14 Jul 2005 00:23 UTC)

Re: constant-time access to variable-width encodings Per Bothner (14 Jul 2005 00:39 UTC)

Re: constant-time access to variable-width encodings bear (14 Jul 2005 01:52 UTC)

Re: constant-time access to variable-width encodings Thomas Bushnell BSG (14 Jul 2005 07:18 UTC)

Re: constant-time access to variable-width encodings Thomas Bushnell BSG (14 Jul 2005 07:16 UTC)

constant-time access to variable-width encodings Per Bothner 13 Jul 2005 18:12 UTC

Here is an idea for an implementation strategy for using UTF-8 or UTF-16
encoding of strings without breaking constant-time string-ref.
Obviously R6RS should not require or assume this implementation, but it
would be nice if it could be written to *allow* it.

Assume a special character:
#\partial - an incomplete character.
We can define it as U+D800 (start of the high surrogates area), since
that is never a valid Unicode character.

The goal is to allow implementations to use plain 8-bit UTF-8 or 16-bit
UTF-16 string encodings, while still allowing constant-time string-ref.
  Both of these encoding have the nice property that it is trivial to
detect whether a stored (8-bit or 16-bit) character is a complete
character or whether it is part of a multibyte encoding or surrogate pair.

The proposal is to allow string-ref to return #\partial for some indexes
representing non-initial bytes or low-surrogate values.  Assume a string
using UTF-8:
"Rød" (Norwegian for "red") - i.e. {#\R, #\xF8, #\d}.
The UTF-8 representation is: {#x52, #xc3, #xb8, #x64 }.
(string-ref "Rød" 0) => #\R
(string-ref "Rød" 1) => #\ø
(string-ref "Rød" 2) => #\partial
(string-ref "Rød" 3) => #\d
(string-length "Rød") => 4 ;; Not 3!

I.e. the complete character value is returned for the index of its first
byte/half, and #\partial is returned for subsequence indexes.

The character #\partial is generally ignored.  Specifically, it is
ignored when printing or by string-set! or the (string char ...)
function.  The character routines also generally "ignore" it:
(char-upcase #\partial) => #\partial
...
(char-alphabetic? #\partial) => #f
...

The string-length function returns the "allocated" length, which is the
same as the number of character *including* any #\partial characters.
Thus existing code generally needs no change.  There is seldom a need to
test explicitly for #\partial - it is treated like a zero-width
"filler", and user code can treat it as such.  That only difference from
a normal (zero-width) character is that it is never explicitly stored in
a string.  But that's an application detail.

This brings us to string-set! and other side-effecting string
procedures.  The obvious problem is: what happens if you replace a
1-byte character with a multibyte character or vice versa?  In that case
you may have to widen or narrow the string.  That may seem expensive,
but in practice is unlikely to be an issue.  Random access of strings is
not something people generally do.  Most of the time people copy a
string or fill it in left-to-right, which means that "replacing" an
existing character isn't a issue.  However, it does mean that a string
may need a variable-size buffer.  But that is needed anyway.

Note that mutable fixed-width strings really make no sense: most strings
are immutable, once constructed.  If you do need to mutate a string, a
fixed-length string is useless.  A fixed-size mutable buffer only makes
sense because it is easy to implement, not because it is useful.

So let's make (mutable) strings variable-length.  The implementation is
trivial:  Each string object contains a pointer to a u8 or u16 buffer,
plus a current length, plus a buffer size (which might be stored with
the buffer).

(Shared substrings are a possibility in this model, but I won't discuss
them further.)

The preferred way to construct a string is now this function:
(string-append! string char-or-string ...)
   Append (in place) each char-or-string to the end of the string.
   If an argument is the #\partial character it is ignored.

This is a cheap constant-time (on average) operation.  But note that
appending a character may change (string-length string) by an
implementation-defined amount:   If the character requires multiple
buffer (u8 or u16) positions, it may increase the string-length by more
than 1, and if it is #\partial it doesn't change the length.  However,
appending a string always causes string-length to increase by the
string-length of the added strings.

It is also reasonable to provide:
(string-replace! string start end replacement-string)
   Replace (in place) (substring start end) by replacement-string.

Now we can implement string-set! in terms of string-replace!:
(define (string-set! string k char)
   (let ((end (start-of-next-char string k)))
      (string-replace! string k end (make-string 1 char))))
where (start-of-next-char string k) is the index of the next real
(non-#\partial) character whose index is > k, or (string-length string)
if there is no such character.

Note that (substring string start end) is undefined if (string-ref
string start) *or* (string-ref string end) is #\partial.

Note that (make-string k char) creates k copies of char, so the
resulting string-length may be different from k.  If char if #\partial
then the resulting string-length is 0.

This may seem a radical proposal, but it actually doesn't change/break
many R5RS idioms/code.
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/