Re: Octet vs Char (Re: strings draft)

Show/hide message thread

strings draft Tom Lord (22 Jan 2004 04:58 UTC)

Re: strings draft Shiro Kawai (22 Jan 2004 09:46 UTC)

Re: strings draft Tom Lord (22 Jan 2004 17:32 UTC)

Re: strings draft Shiro Kawai (23 Jan 2004 05:03 UTC)

Re: strings draft Tom Lord (24 Jan 2004 00:31 UTC)

Re: strings draft Matthew Dempsky (24 Jan 2004 03:00 UTC)

Re: strings draft Shiro Kawai (24 Jan 2004 03:27 UTC)

Re: strings draft Tom Lord (24 Jan 2004 04:18 UTC)

Re: strings draft Shiro Kawai (24 Jan 2004 04:49 UTC)

Re: strings draft Tom Lord (24 Jan 2004 18:47 UTC)

Re: strings draft Shiro Kawai (24 Jan 2004 22:16 UTC)

Octet vs Char (Re: strings draft) Shiro Kawai (26 Jan 2004 09:58 UTC)

Strings, one last detail. bear (30 Jan 2004 21:12 UTC)

Re: Strings, one last detail. Shiro Kawai (30 Jan 2004 21:43 UTC)

Re: Strings, one last detail. Tom Lord (31 Jan 2004 00:13 UTC)

Re: Strings, one last detail. bear (31 Jan 2004 20:26 UTC)

Re: Strings, one last detail. Tom Lord (31 Jan 2004 20:42 UTC)

Re: Strings, one last detail. bear (01 Feb 2004 02:29 UTC)

Re: Strings, one last detail. Tom Lord (01 Feb 2004 02:44 UTC)

Re: Strings, one last detail. bear (01 Feb 2004 07:53 UTC)

Re: Octet vs Char (Re: strings draft) bear (26 Jan 2004 19:04 UTC)

Re: Octet vs Char (Re: strings draft) Matthew Dempsky (26 Jan 2004 20:12 UTC)

Re: Octet vs Char (Re: strings draft) Matthew Dempsky (26 Jan 2004 20:40 UTC)

Re: Octet vs Char (Re: strings draft) Ken Dickey (27 Jan 2004 04:33 UTC)

Re: Octet vs Char Shiro Kawai (27 Jan 2004 05:12 UTC)

Re: Octet vs Char Tom Lord (27 Jan 2004 05:23 UTC)

Re: Octet vs Char bear (27 Jan 2004 08:35 UTC)

Re: Octet vs Char (Re: strings draft) bear (27 Jan 2004 08:33 UTC)

Re: Octet vs Char (Re: strings draft) Ken Dickey (27 Jan 2004 15:43 UTC)

Re: Octet vs Char (Re: strings draft) bear (27 Jan 2004 19:06 UTC)

Re: Octet vs Char Shiro Kawai (26 Jan 2004 23:39 UTC)

Re: strings draft bear (22 Jan 2004 19:05 UTC)

Re: strings draft Tom Lord (23 Jan 2004 01:53 UTC)

READ-OCTET (Re: strings draft) Shiro Kawai (23 Jan 2004 06:01 UTC)

Re: strings draft bear (23 Jan 2004 07:04 UTC)

Re: strings draft bear (23 Jan 2004 07:20 UTC)

Re: strings draft Tom Lord (24 Jan 2004 00:02 UTC)

Re: strings draft Alex Shinn (26 Jan 2004 01:59 UTC)

Re: strings draft Tom Lord (26 Jan 2004 02:22 UTC)

Re: strings draft bear (26 Jan 2004 02:35 UTC)

Re: strings draft Tom Lord (26 Jan 2004 02:48 UTC)

Re: strings draft Alex Shinn (26 Jan 2004 03:00 UTC)

Re: strings draft Tom Lord (26 Jan 2004 03:14 UTC)

Re: strings draft Shiro Kawai (26 Jan 2004 04:57 UTC)

Re: strings draft Alex Shinn (26 Jan 2004 04:58 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 18:48 UTC)

Re: strings draft bear (24 Jan 2004 02:21 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 02:10 UTC)

Re: strings draft Tom Lord (23 Jan 2004 02:29 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 02:44 UTC)

Re: strings draft Tom Lord (23 Jan 2004 02:53 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:04 UTC)

Re: strings draft Tom Lord (23 Jan 2004 03:16 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:42 UTC)

Re: strings draft Alex Shinn (23 Jan 2004 02:35 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 02:42 UTC)

Re: strings draft Tom Lord (23 Jan 2004 02:49 UTC)

Re: strings draft Alex Shinn (23 Jan 2004 02:58 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:13 UTC)

Re: strings draft Alex Shinn (23 Jan 2004 03:19 UTC)

Re: strings draft Bradd W. Szonye (23 Jan 2004 19:31 UTC)

Re: strings draft Alex Shinn (26 Jan 2004 02:22 UTC)

Re: strings draft Bradd W. Szonye (06 Feb 2004 23:30 UTC)

Re: strings draft Bradd W. Szonye (06 Feb 2004 23:33 UTC)

Re: strings draft Alex Shinn (09 Feb 2004 01:45 UTC)

specifying source encoding (Re: strings draft) Shiro Kawai (09 Feb 2004 02:51 UTC)

Re: strings draft Bradd W. Szonye (09 Feb 2004 03:39 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:12 UTC)

Re: strings draft Alex Shinn (23 Jan 2004 03:28 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:44 UTC)

Parsing Scheme [was Re: strings draft] Ken Dickey (23 Jan 2004 17:02 UTC)

Re: Parsing Scheme [was Re: strings draft] bear (23 Jan 2004 17:56 UTC)

Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (23 Jan 2004 18:50 UTC)

Re: Parsing Scheme [was Re: strings draft] Per Bothner (23 Jan 2004 18:56 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 20:26 UTC)

Re: Parsing Scheme [was Re: strings draft] Per Bothner (23 Jan 2004 20:57 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 21:44 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 20:07 UTC)

Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (23 Jan 2004 21:22 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 22:38 UTC)

Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (24 Jan 2004 06:48 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (24 Jan 2004 18:41 UTC)

Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (24 Jan 2004 19:34 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (24 Jan 2004 21:48 UTC)

Re: Parsing Scheme [was Re: strings draft] Ken Dickey (23 Jan 2004 21:47 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 23:22 UTC)

Re: Parsing Scheme [was Re: strings draft] Ken Dickey (25 Jan 2004 01:03 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (25 Jan 2004 03:01 UTC)

Re: strings draft Matthew Dempsky (25 Jan 2004 06:59 UTC)

Re: strings draft Tom Lord (25 Jan 2004 07:16 UTC)

Re: strings draft Matthew Dempsky (26 Jan 2004 23:52 UTC)

Re: strings draft Tom Lord (27 Jan 2004 00:30 UTC)

Re: Octet vs Char (Re: strings draft) bear 27 Jan 2004 08:32 UTC

On Mon, 26 Jan 2004, Ken Dickey wrote:

>Well color me dumb, but I don't see why getting O(1) is such a big deal.
>
>Let's say an implementation reads UTF-8 chars.  A uni-string could be
>represented as a tuple.  The required part is a byte-vector of characters
>that fit into 8 bits.  If characters have 16 bit representation, use a
>second, 16-bit vector and a map of which chars are the long ones.  If 16 bit,
>index into the 2nd array, else into the first.  The 2nd array could be
>shortened by using the byte-vector to hold indexes into the 16-bit array in
>the mapped/unused bytevector slots.  In the "worst case" (storage wise), we
>only need the 16-bit array.  The mapping could be a bit-vector or whatever
>makes sense in a particular implementation.
>
>It is true that there are more code paths in this scheme (sic), but each kind
>of representation "knows that" and this still makes refs & sets ~ O(1).
>
>What am I missing here?

O(1) reference or character setting comes at the expense of O(n) insertions,
deletions, and non-identical-sized replacements.

EG, if I change "the" to "a" at the beginning of a long string, and
I've represented it as a vector to get O(1) reference time, the rest of
the string has to be copied to move it two character spaces in memory.

This is no big deal, on the same order as a function call overhead, for
strings of 250 characters or less.  But it starts to be a very big deal
when the string is the size of a large novel, around  2 million
characters.

Likewise, when the application decides it wants to make a copy of a
string and change just the copy, the naive implementation using strings
made of byte vectors copies the whole thing to a new vector allocated
for a new variable, then copies it again within that vector to move the
bulk of the new string 2 spaces.  A more sophisticated implementation
still copies the whole thing at least once.

Implementing ropes allows the system instead to copy just the root node
of the string's Btree (constant-size unit) in order to copy a string,
and then copy tree nodes down to the substring where "the" will be
changed to "a", make a copy of just that substring, and shift the bulk
of just that copy 2 characters.  If the strings are large, this saves
you a bunch of CPU and a bunch of memory, since the bulk of the structure
- all the tree nodes except a single root-to-leaf path and all the
substrings but the one changed - is still shared between the two strings.

Below, I'm comparing ropes to the simplest type of vector string,
in which all characters are equal width.  You've offered a
technique to get similar performance with strings having unequal-
width characters (such as UTF-8 encoding) and I think it'd work,
but here are still compelling reasons in at least some applications
to choose ropes.

Here's how the operations break down (use a constant-width font):
==========================================================
                     "Rope" string     |  vector string
indexed reference;                     |
                        O(log n)       |  O(1)
      -------------------------------------------------
indexed mutation                       |
                        O(log n)       |   O(1)
      -------------------------------------------------
string-copy                            |
                        O(1),          |  O(n),
                        Omega(1)       |  omega(n)
      -------------------------------------------------
unequal-size mutation                  |
                       O(log n)        |  O(n)
                       Omega(log n) *1 |  Omega(1)
      -------------------------------------------------

*1; Unequal-size mutation uses Omega(log n) memory only if the data
    is shared with another string (ie, it uses part of the memory
    spaces that has already been saved in a string-copy operation).
    If the data is unshared, as with vector strings, it's Omega(1).
===========================================================

As you can see, vector strings win on indexed reference and indexed
mutation, but rope strings win everywhere else. It is possible to use
more memory doing an unequal-size mutation, but that's only in the
case that the vector string would already have used even more memory
in a string-copy operation.

Order(1) indexed reference and indexed single-character mutation is
nice, but the time and space taken by the other operations starts
becoming untenable when the size of the strings gets sufficiently
large.

If you need to know which is best for your application, you may
want to instrument your code and use a profiler to find out just
how much time you're spending copying strings.

					Bear