Email list hosting service & mailing list manager

freshman-level Boyer-Moore fast string search William D Clinger (29 Jul 2005 05:31 UTC)
Re: freshman-level Boyer-Moore fast string search Per Bothner (29 Jul 2005 07:13 UTC)

Re: freshman-level Boyer-Moore fast string search Per Bothner 29 Jul 2005 07:13 UTC

William D Clinger wrote:
> With random access to the characters of s, the operation
> described by the previous paragraph runs in O(lg m) time.
> With random access to the bytes of s represented in UTF-8,
> however, the previous paragraph takes O(m) time.  The problem
> with UTF-8 is that we can't access s[i+m-1] without looking
> at all of the bytes that encode the characters that precede
> it.  We can look at the byte with index i+m-1, and we might
> even be able to calculate the character whose UTF-8 encoding
> includes the byte at byte index i+m-1 (I don't know Unicode
> well enough to know whether this is possible), but knowing
> that character doesn't help.  For the FLBM algorithm to work,
> we have to know the *character* with index i+m-1.

But a FLBM using UTF-8 strings would naturally index and compare
*bytes*, not characters.  I.e. the obvious algorithm would
let m be the number of *bytes* in the UTF-8 encoding of s0;
n the total number of bytes in the strings to be searched,
and the i in the results <s,i> would be byte offsets.

The pre-processing would be to generate a table, such that:
For each byte that occurs within s0, the table contains the index of the
last occurrence of that byte within s0.

So the question is whether this is still true: "In the best case,
however, the FLBM algorithm runs in O(n/m lg m) time.  This is also the
usual case in practice."  I suspect so.  Of course n and m are larger,
since they now measure bytes rather than characters, but these are
constant factors that will be dwarfed by other issues (such as cache
misses).
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/