Re: Octet vs Char (Re: strings draft)

Show/hide message thread

strings draft Tom Lord (22 Jan 2004 04:58 UTC)

Re: strings draft Shiro Kawai (22 Jan 2004 09:46 UTC)

Re: strings draft Tom Lord (22 Jan 2004 17:32 UTC)

Re: strings draft Shiro Kawai (23 Jan 2004 05:03 UTC)

Re: strings draft Tom Lord (24 Jan 2004 00:31 UTC)

Re: strings draft Matthew Dempsky (24 Jan 2004 03:00 UTC)

Re: strings draft Shiro Kawai (24 Jan 2004 03:27 UTC)

Re: strings draft Tom Lord (24 Jan 2004 04:18 UTC)

Re: strings draft Shiro Kawai (24 Jan 2004 04:49 UTC)

Re: strings draft Tom Lord (24 Jan 2004 18:47 UTC)

Re: strings draft Shiro Kawai (24 Jan 2004 22:16 UTC)

Octet vs Char (Re: strings draft) Shiro Kawai (26 Jan 2004 09:58 UTC)

Re: Octet vs Char (Re: strings draft) bear (26 Jan 2004 19:04 UTC)

Re: Octet vs Char (Re: strings draft) Matthew Dempsky (26 Jan 2004 20:12 UTC)

Re: Octet vs Char (Re: strings draft) Matthew Dempsky (26 Jan 2004 20:40 UTC)

Re: Octet vs Char Shiro Kawai (26 Jan 2004 23:39 UTC)

Strings, one last detail. bear (30 Jan 2004 21:12 UTC)

Re: Strings, one last detail. Shiro Kawai (30 Jan 2004 21:43 UTC)

Re: Strings, one last detail. Tom Lord (31 Jan 2004 00:13 UTC)

Re: Strings, one last detail. bear (31 Jan 2004 20:26 UTC)

Re: Strings, one last detail. Tom Lord (31 Jan 2004 20:42 UTC)

Re: Strings, one last detail. bear (01 Feb 2004 02:29 UTC)

Re: Strings, one last detail. Tom Lord (01 Feb 2004 02:44 UTC)

Re: Strings, one last detail. bear (01 Feb 2004 07:53 UTC)

Re: Octet vs Char (Re: strings draft) Ken Dickey (27 Jan 2004 04:33 UTC)

Re: Octet vs Char Shiro Kawai (27 Jan 2004 05:12 UTC)

Re: Octet vs Char Tom Lord (27 Jan 2004 05:23 UTC)

Re: Octet vs Char bear (27 Jan 2004 08:35 UTC)

Re: Octet vs Char (Re: strings draft) bear (27 Jan 2004 08:33 UTC)

Re: Octet vs Char (Re: strings draft) Ken Dickey (27 Jan 2004 15:43 UTC)

Re: Octet vs Char (Re: strings draft) bear (27 Jan 2004 19:06 UTC)

Re: strings draft bear (22 Jan 2004 19:05 UTC)

Re: strings draft Tom Lord (23 Jan 2004 01:53 UTC)

READ-OCTET (Re: strings draft) Shiro Kawai (23 Jan 2004 06:01 UTC)

Re: strings draft bear (23 Jan 2004 07:04 UTC)

Re: strings draft bear (23 Jan 2004 07:20 UTC)

Re: strings draft Tom Lord (24 Jan 2004 00:02 UTC)

Re: strings draft Alex Shinn (26 Jan 2004 01:59 UTC)

Re: strings draft Tom Lord (26 Jan 2004 02:22 UTC)

Re: strings draft bear (26 Jan 2004 02:35 UTC)

Re: strings draft Tom Lord (26 Jan 2004 02:48 UTC)

Re: strings draft Alex Shinn (26 Jan 2004 03:00 UTC)

Re: strings draft Tom Lord (26 Jan 2004 03:14 UTC)

Re: strings draft Shiro Kawai (26 Jan 2004 04:57 UTC)

Re: strings draft Alex Shinn (26 Jan 2004 04:58 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 18:48 UTC)

Re: strings draft bear (24 Jan 2004 02:21 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 02:10 UTC)

Re: strings draft Tom Lord (23 Jan 2004 02:29 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 02:44 UTC)

Re: strings draft Tom Lord (23 Jan 2004 02:53 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:04 UTC)

Re: strings draft Tom Lord (23 Jan 2004 03:16 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:42 UTC)

Re: strings draft Alex Shinn (23 Jan 2004 02:35 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 02:42 UTC)

Re: strings draft Tom Lord (23 Jan 2004 02:49 UTC)

Re: strings draft Alex Shinn (23 Jan 2004 02:58 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:13 UTC)

Re: strings draft Alex Shinn (23 Jan 2004 03:19 UTC)

Re: strings draft Bradd W. Szonye (23 Jan 2004 19:31 UTC)

Re: strings draft Alex Shinn (26 Jan 2004 02:22 UTC)

Re: strings draft Bradd W. Szonye (06 Feb 2004 23:30 UTC)

Re: strings draft Bradd W. Szonye (06 Feb 2004 23:33 UTC)

Re: strings draft Alex Shinn (09 Feb 2004 01:45 UTC)

specifying source encoding (Re: strings draft) Shiro Kawai (09 Feb 2004 02:51 UTC)

Re: strings draft Bradd W. Szonye (09 Feb 2004 03:39 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:12 UTC)

Re: strings draft Alex Shinn (23 Jan 2004 03:28 UTC)

Re: strings draft tb@xxxxxx (23 Jan 2004 03:44 UTC)

Parsing Scheme [was Re: strings draft] Ken Dickey (23 Jan 2004 17:02 UTC)

Re: Parsing Scheme [was Re: strings draft] bear (23 Jan 2004 17:56 UTC)

Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (23 Jan 2004 18:50 UTC)

Re: Parsing Scheme [was Re: strings draft] Per Bothner (23 Jan 2004 18:56 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 20:26 UTC)

Re: Parsing Scheme [was Re: strings draft] Per Bothner (23 Jan 2004 20:57 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 21:44 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 20:07 UTC)

Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (23 Jan 2004 21:22 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 22:38 UTC)

Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (24 Jan 2004 06:48 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (24 Jan 2004 18:41 UTC)

Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (24 Jan 2004 19:34 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (24 Jan 2004 21:48 UTC)

Re: Parsing Scheme [was Re: strings draft] Ken Dickey (23 Jan 2004 21:47 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 23:22 UTC)

Re: Parsing Scheme [was Re: strings draft] Ken Dickey (25 Jan 2004 01:03 UTC)

Re: Parsing Scheme [was Re: strings draft] Tom Lord (25 Jan 2004 03:01 UTC)

Re: strings draft Matthew Dempsky (25 Jan 2004 06:59 UTC)

Re: strings draft Tom Lord (25 Jan 2004 07:16 UTC)

Re: strings draft Matthew Dempsky (26 Jan 2004 23:52 UTC)

Re: strings draft Tom Lord (27 Jan 2004 00:30 UTC)

Re: Octet vs Char (Re: strings draft) bear 27 Jan 2004 19:06 UTC

On Tue, 27 Jan 2004, Ken Dickey wrote:

>On Tuesday 27 January 2004 09:32 am, bear wrote:
>> On Mon, 26 Jan 2004, Ken Dickey wrote:
>> >Well color me dumb, but I don't see why getting O(1) is such a big deal.
>...
>> O(1) reference or character setting comes at the expense of O(n)
>> insertions, deletions, and non-identical-sized replacements.
>>
>> EG, if I change "the" to "a" at the beginning of a long string, and
>> I've represented it as a vector to get O(1) reference time, the rest of
>> the string has to be copied to move it two character spaces in memory.

>I was puzzled by the ropes discussion here because it seemed to be orthogonal
>to the  Unicode discussion.  I now see that its because it _is_ orthogonal to
>the Unicode discussion.

The only thing that unicode has to do with it is that unicode
makes non-identical sized replacements more likely, and makes
it more likely that the programmer will not realize that a given
operation involves non-identical sized replacements.  Replacing
one codepoint with another may wind up being a replacement of a
character that takes 1 octet of UTF-8 to express with a character
that takes 3 octets of UTF-8 to express, or vice versa.  This sort
of thing is amenable to your proposed approach of indexed fallback
into another vector.

But replacing a character with a combining sequence of multiple
codepoints, or vice versa, is also likely; in fact the Unicode
Consortium's canonicalization algorithms do this all the time.
In this case you're looking at things like replacing

U+212B ANGSTROM SIGN
with
U+41 LATIN CAPITAL LETTER A , U+30A COMBINING RING ABOVE

and if your implementation treats the former as one character
and the latter as two characters, which most do, you wind up
with the same need to copy the rest of the string that changing
"a" to "the" caused in ASCII strings.  This is not amenable to
your proposed approach of indexed fallback into another vector.

What this means is that, while on an absolute level Unicode and
rope representation are orthogonal issues, Unicode has patterns
of likely use that rely heavily on the most expensive operations
of vector representations.

And of course both came up here because the first draft of the
FFI SRFI wanted a C pointer to a mutable memory area containing
the internal representation of a scheme string, and has to know
this kind of "detail" to even make sense of what it finds there.

As a result of the discussions here, I'm now considering
adding more types of string values, each with its own read
syntax and conversions:  For example,

#,(Latin-1-vector "hello world")
 would be an octet vector where each octet is a latin-1
 character.  This would make binary I/O using string-like
 constructions possible and give C programs the kind of
 FFI value they wanted. No characters outside Latin-1
 would be allowed, of course.

#,(UTF32-vector "hello world")
 would be a "string" indexed by unicode codepoint rather
 than by character.  Handy for FFI, and also allows people
 to create invalid or non-canonical combining sequences,
 assign values that aren't even mapped codepoints to arbitrary
 locations, or do other linguistically wrong operations.
 However, converting it to a regular string would canonicalize
 it, and would fail if it contained non-characters.

				Bear