Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)
Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)
Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)
Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)
Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)
Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)
Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation Alan Watson 28 Jul 2005 17:02 UTC

Hi again,

The application of character indexes into a corpus is very interesting.
Thanks for bringing it up.

However, I wonder how bad UTF-8 really is. For example, if I want to
extract all of the prepositions, I can sort the character index ranges
and then make a single pass through the string. This is linear in the
string length, which is not as nice as random accesses to a UCS-32
vector, but isn't obviously a killer. (Especially when one thinks about
memory cache hierarchies and their effect on random accesses.)

There is a difference between using character indexes into UTF-8 with
locality (i.e., scanning forwards or backwards through a string or using
something like B-M which has a fair bit of locality) and real random
access. If the implementation caches the last character to byte index
conversion, the former can often be linear whereas the latter is
quadratic (string length times the number of accesses).

So, two questions:

(1) Are your "random" accesses into your corpus linguistics strings
really random, do they have significant locality, or could they be
arranged to have have significant locality?

(2) Could you live with linear complexity to extract classes of substrings?

Regards,

Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México