Re: Surrogates and character representation Alan Watson 29 Jul 2005 15:33 UTC
bear wrote: >>(1) Are your "random" accesses into your corpus linguistics strings >>really random, do they have significant locality, or could they be >>arranged to have have significant locality? > > > Speaking for myself, I would say they are as close to random as > makes no difference. Thanks for your answer. I think I'm convinced that representing strings in plain UTF-8 is a losing representation for this application. Or, generalizing, this application really needs strings that have constant-time random access and not just linear-time traversal. If I wanted to rescue UTF-8 (because I really really really want to keep conversion to UTF-8 as a constant-time operation), I could maintain a vector of byte offsets to every Nth character. Regards, Alan -- Dr Alan Watson Centro de Radioastronomía y Astrofísica Universidad Astronómico Nacional de México