Hi Will,


Note also that many texts are quite short, which is why some
systems have been able to get by with UTF-8 or UTF-16.  The
binary logarithm of 32 is 5, putting O(lg n) representations
at a five-fold disadvantage, and they can't make up for that
in simplicity because the O(1) algorithms are already pretty
simple.

Well, if Taylan's explanation is correct... if your subpart size is greater than 32, you couldd be doing linear scan of the entire 32-codepoint string if it's in UTF-8 backend, because the entire short string is less than the subpart size.  That would be 32 operations to the logarithmic's 5...

Although I suppose if your system dynamically selects subpart size so that the 32-codepoint string has subparts of 4 characters each, you'd be doing 4 UTF-8 scans at most and still beating the logarithmic algorithm.  But dynamic selection of subpart size would reduce structure sharing, I'd think.

Sincerely,
AmkG