Re: Surrogates and character representation
Alan Watson 29 Jul 2005 15:33 UTC
bear wrote:
>>(1) Are your "random" accesses into your corpus linguistics strings
>>really random, do they have significant locality, or could they be
>>arranged to have have significant locality?
>
>
> Speaking for myself, I would say they are as close to random as
> makes no difference.
Thanks for your answer.
I think I'm convinced that representing strings in plain UTF-8 is a
losing representation for this application. Or, generalizing, this
application really needs strings that have constant-time random access
and not just linear-time traversal.
If I wanted to rescue UTF-8 (because I really really really want to keep
conversion to UTF-8 as a constant-time operation), I could maintain a
vector of byte offsets to every Nth character.
Regards,
Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México