Re: Surrogates and character representation
Alan Watson 28 Jul 2005 17:02 UTC
Hi again,
The application of character indexes into a corpus is very interesting.
Thanks for bringing it up.
However, I wonder how bad UTF-8 really is. For example, if I want to
extract all of the prepositions, I can sort the character index ranges
and then make a single pass through the string. This is linear in the
string length, which is not as nice as random accesses to a UCS-32
vector, but isn't obviously a killer. (Especially when one thinks about
memory cache hierarchies and their effect on random accesses.)
There is a difference between using character indexes into UTF-8 with
locality (i.e., scanning forwards or backwards through a string or using
something like B-M which has a fair bit of locality) and real random
access. If the implementation caches the last character to byte index
conversion, the former can often be linear whereas the latter is
quadratic (string length times the number of accesses).
So, two questions:
(1) Are your "random" accesses into your corpus linguistics strings
really random, do they have significant locality, or could they be
arranged to have have significant locality?
(2) Could you live with linear complexity to extract classes of substrings?
Regards,
Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México