Email list hosting service & mailing list manager

Re: Surrogates and character representation Thomas Lord (27 Jul 2005 18:31 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 19:55 UTC)

Re: Surrogates and character representation John.Cowan 27 Jul 2005 19:54 UTC

Thomas Lord scripsit:

> My plan (and stalled code) works that way.  If a
> string contains only codepoints in 0..255, store it as bytes.
> 0..ffff, use 16-bits, otherwise, use 32.

This is a plausible design.  If you are willing to pay more time to save
some more space, you could have multiple flavors of single-byte strings
based on SCSU dynamic windows.  Keep a single overhead byte T with each
single-byte string that indicates the meaning of the byte range 80-FF:

Value of T      Unicode offset  Comment
01..67          x*80            half-blocks from U+0080 to U+3380
68..A7          x*80+AC00       half-blocks from U+E000 to U+FF80
F9              00C0            Latin-1 letters + half of Latin Extended-A
FA              0250            IPA Extensions
FB              0370            Greek
FC              0530            Armenian
FD              3040            Hiragana
FE              30A0            Katakana
FF              FF60            Halfwidth Katakana

So your byte strings (range U+0000..U+00FF) would have an T byte of 01.
Of course there is no requirement to implement this entire scheme;
you can cherry-pick particular T values that make sense.

--
As you read this, I don't want you to feel      John Cowan
sorry for me, because, I believe everyone       xxxxxx@reutershealth.com
will die someday.                               http://www.reutershealth.com
        --From a Nigerian-type scam spam        http://www.ccil.org/~cowan