Fundamental operations on strings

Show/hide message thread
Fundamental operations on strings Shiro Kawai (13 Jul 2005 19:42 UTC)
Re: Fundamental operations on strings bear (14 Jul 2005 00:03 UTC)
Fundamental operations on strings Shiro Kawai 13 Jul 2005 19:42 UTC
The current draft says:

   Like in R5RS, a Scheme string is a fixed-length array
   of Scheme characters.

What R5RS says is "Strings are sequences of characters."  Although
having make-string, string-ref and string-set! as primitives
suggests R5RS authors had an array model in their mind, a string
can be implemented differently.
If strings are meant to manipulate texts, as stated in this srfi,
such alternative implementation can even be better than the simple
"fixed-length mutable array of characters" view.

In text, a character's meaning is strongly associated with other
characters that surrounds it, but not so much with its absolute
position.  Common operations on text are searching, substring
extraction, substring replacment and concatenation.   String-ref
does provide a basic building block for them, but I don't see how
useful string-set! can be, except when you stick to fixed-length
mutable array model, with allocating a fresh string by make-string
and fill it with string-set!.

Restricting string implementation to this mutable array model
bothers me in a couple of points.

- Except the "allocate and fill-in" operation, a fixed-length
  mutation provided by string-set! has little practical value.
  In applications, length-changing subustring replacement
  is far common, and replacing a single character is just a
  special case of it.

- If we restrict our idea of string to an array of characters,
  string type becomes merely a special type of a vector.
  Then, why do we need to have a whole bunch of string
  operations, apart from vector operations, in the language
  we want to keep concise?  What justifies to have a distinct
  string type is that its nature is so different from vectors.

If possible, I'd like to see strings become either:

a) A mutable sequence of characters with length-changing
   mutation operator (e.g. string-replace!) as a core procedure.
   The current string-set! is just a special case of
   string-replace!, and becomes a library procedure.

b) An immutable sequence of characters.  make-string becomes
   rather useless, so the fundamental string constructor should
   be list->string and vector->string (though, in practice, string
   ports would be used far often).  This allows very efficient
   substring extraction (which is, far more common than mutation).
   Mutable "text" library can be built on top of this fundamental
   string library, which will include length-changing mutation.

I personally see immutable strings cleaner.  It breaks backward
compatibility, yes, but do you let compatibility drag you
to leave warts in design?  Besides, if you concern about
existing code, count the use of string-set!, less the cases
where it is used for sequential string construction, which is
better be replaced for string ports.

Even if people think these changes are too radical and want to
keep compatibility, I'd still like R6RS doesn't push "fixed-length
mutable  array model" too much, allowing implementators to use
their imaginations to come up with better representations.

--shiro