Email list hosting service & mailing list manager

Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)
Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)
Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)
Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)
Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)
Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)
Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation Alex Shinn 28 Jul 2005 01:45 UTC

On 7/28/05, Alan Watson <xxxxxx@astrosmo.unam.mx> wrote:
> William D Clinger wrote:
> > Per Bothner wrote:
> >  > Random accesses to a position in a string that has not
> >  > been previously accessed is not in itself useful.
> >
> > Untrue.  The Boyer-Moore algorithm for fast string
> > searching uses random accesses to positions that have
> > not been previously accessed [1].
>
> Yes, but I think you can implement this for UTF-8 or UTF-16 strings
> using offsets to the underlying bytes or shorts. I don't think that you
> need character offsets.

True.  In fact any existing Boyer-Moore search that works at
the 8-bit byte level will work unmodified on UTF-8 strings.

Furthermore, the true cost of BM includes a preprocessing step
of O(m+sigma) in both time and space, where m is the pattern
length and sigma is the size of the alphabet.  For the byte-level
UTF-8 search, sigma is 256, but for UTF-32 sigma is 2^21 = 2097152.
For patterns that are going to be repeatedly searched, it is reasonable
to keep the table for UTF-8 in memory, but the table for UTF-32,
if using a SRFI-4 u32vector for offsets, would incur an extra 8MB of
memory per pattern.  This isn't reasonable - in practice you'll want
to perform the preprocessing step every time for UTF-32.  UTF-8 wins
hands down for string searching.

--
Alex