Re: Surrogates and character representation
Alex Shinn 28 Jul 2005 01:45 UTC
On 7/28/05, Alan Watson <xxxxxx@astrosmo.unam.mx> wrote:
> William D Clinger wrote:
> > Per Bothner wrote:
> > > Random accesses to a position in a string that has not
> > > been previously accessed is not in itself useful.
> >
> > Untrue. The Boyer-Moore algorithm for fast string
> > searching uses random accesses to positions that have
> > not been previously accessed [1].
>
> Yes, but I think you can implement this for UTF-8 or UTF-16 strings
> using offsets to the underlying bytes or shorts. I don't think that you
> need character offsets.
True. In fact any existing Boyer-Moore search that works at
the 8-bit byte level will work unmodified on UTF-8 strings.
Furthermore, the true cost of BM includes a preprocessing step
of O(m+sigma) in both time and space, where m is the pattern
length and sigma is the size of the alphabet. For the byte-level
UTF-8 search, sigma is 256, but for UTF-32 sigma is 2^21 = 2097152.
For patterns that are going to be repeatedly searched, it is reasonable
to keep the table for UTF-8 in memory, but the table for UTF-32,
if using a SRFI-4 u32vector for offsets, would incur an extra 8MB of
memory per pattern. This isn't reasonable - in practice you'll want
to perform the preprocessing step every time for UTF-32. UTF-8 wins
hands down for string searching.
--
Alex