Email list hosting service & mailing list manager

Re: Surrogates and character representation Thomas Lord (27 Jul 2005 18:31 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 19:55 UTC)

Re: Surrogates and character representation Thomas Lord 27 Jul 2005 18:32 UTC

I'm surprised that nobody in the threads about
constant-time access to Unicode strings has
mentioned adaptive encoding forms.

My plan (and stalled code) works that way.  If a
string contains only codepoints in 0..255, store it as bytes.
0..ffff, use 16-bits, otherwise, use 32.

All access to a given codepoint position is O(1) that way.
Some mutations and are worst-case linear in
the length of the string but can be expected case O(1).

Some strings need to be converted before being passed to
functions provided by the native environment.  On
the other hand, linguistic text is likely to be space
efficient.

This technique internally uses non-standard and
restricted encoding forms.  Surrogates are never
used in this representation as a stand-in for a
wider character and so there is no difficulty
handling unpaired surrogates.  (Even concatenating
one string ending in a high surrogate with one
beginning with a low surrogate produces the desirable
result: a string with two adjacent unpaired surrogates.)

-t