Re: Surrogates and character representation

Show/hide message thread

Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 03:28 UTC)

Surrogates and character representation Tom Emerson (22 Jul 2005 03:55 UTC)

Re: Surrogates and character representation John.Cowan (22 Jul 2005 04:09 UTC)

Re: Surrogates and character representation Tom Emerson (22 Jul 2005 04:26 UTC)

Re: Surrogates and character representation Thomas Bushnell BSG (23 Jul 2005 07:19 UTC)

Re: Surrogates and character representation Tom Emerson (23 Jul 2005 17:38 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 05:37 UTC)

Re: Surrogates and character representation Shiro Kawai (24 Jul 2005 08:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 13:25 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 17:32 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 17:54 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 18:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 20:18 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 18:25 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 23:02 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 23:26 UTC)

Re: Surrogates and character representation Alan Watson (25 Jul 2005 17:24 UTC)

Re: Surrogates and character representation bear (27 Jul 2005 16:16 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 22:12 UTC)

Re: Surrogates and character representation Ken Dickey (24 Jul 2005 09:35 UTC)

Re: Surrogates and character representation Michael Sperber (24 Jul 2005 11:47 UTC)

Re: the "Unicode Background" section Matthew Flatt (22 Jul 2005 04:30 UTC)

Re: the "Unicode Background" section Alex Shinn (22 Jul 2005 05:42 UTC)

Re: the "Unicode Background" section bear (22 Jul 2005 15:45 UTC)

Re: the "Unicode Background" section Tom Emerson (22 Jul 2005 15:56 UTC)

Re: Surrogates and character representation bear 27 Jul 2005 16:16 UTC

On Mon, 25 Jul 2005, Alan Watson wrote:

>Files actually provide a fairly close analogy to the commonest means of
>representing Unicode strings.
>
>Imagine a file system that implements files as streams of bytes. Now
>imagine that you want to read the Nth *line*. The only way to do this is
>to read through the file until you have encounted N-1 newlines. This is
>like finding the Nth character when using UTF-8 for strings.
>
>Now imagine a file system that implements files as enumerated
>random-access records and uses exactly one record for each line. You can
>directly read the Nth line. This is like finding the Nth character when
>using UCS-32 for strings.
>
>Now imagine a file system that implements files as enumerated
>random-access records and uses one or more record for each line. This is
>like using UTF-16 for strings.

FWIW, I'm representing strings as trees, where each
branch of the tree records how many units are contained
within the leaves depending from that branch.  Initially
my only units were grapheme clusters, but now I keep
track of newlines and codepoints as well.

So, it's possible to navigate the triple-indexed tree by
any of the three indexes (codepints, graphemes, or lines)
and get to the right leaf.  Of course, having gotten
there, you may need to count from the beginning of the
leaf buffer to get to exactly where you wanted to be.
The result is that it's not actual constant-time random
access, but it's log(n) access in any of the reasonable
indexes. One important thing is that the size of the leaf
buffers is strictly limited to 1020 codepoints or less.
You don't have to start counting from the beginning of
the string.  Another is that different leaves can use
different fixed-width representations:  Latin1, BMP,
or UTF-32 depending on what is the widest codepoint in
that leaf's buffer.  In practice, 99% or more of every
long string is represented in ascii/Latin1.

I think a lot of people won't want to do this, but my
point is that it is possible to strike a good compromise
and multi-index your strings so that you can have fast
access in any of several representations.

				Bear