Re: Surrogates and character representation
John.Cowan 27 Jul 2005 19:54 UTC
Thomas Lord scripsit:
> My plan (and stalled code) works that way. If a
> string contains only codepoints in 0..255, store it as bytes.
> 0..ffff, use 16-bits, otherwise, use 32.
This is a plausible design. If you are willing to pay more time to save
some more space, you could have multiple flavors of single-byte strings
based on SCSU dynamic windows. Keep a single overhead byte T with each
single-byte string that indicates the meaning of the byte range 80-FF:
Value of T Unicode offset Comment
01..67 x*80 half-blocks from U+0080 to U+3380
68..A7 x*80+AC00 half-blocks from U+E000 to U+FF80
F9 00C0 Latin-1 letters + half of Latin Extended-A
FA 0250 IPA Extensions
FB 0370 Greek
FC 0530 Armenian
FD 3040 Hiragana
FE 30A0 Katakana
FF FF60 Halfwidth Katakana
So your byte strings (range U+0000..U+00FF) would have an T byte of 01.
Of course there is no requirement to implement this entire scheme;
you can cherry-pick particular T values that make sense.
--
As you read this, I don't want you to feel John Cowan
sorry for me, because, I believe everyone xxxxxx@reutershealth.com
will die someday. http://www.reutershealth.com
--From a Nigerian-type scam spam http://www.ccil.org/~cowan