Re: freshman-level Boyer-Moore fast string search
Per Bothner 29 Jul 2005 07:13 UTC
William D Clinger wrote:
> With random access to the characters of s, the operation
> described by the previous paragraph runs in O(lg m) time.
> With random access to the bytes of s represented in UTF-8,
> however, the previous paragraph takes O(m) time. The problem
> with UTF-8 is that we can't access s[i+m-1] without looking
> at all of the bytes that encode the characters that precede
> it. We can look at the byte with index i+m-1, and we might
> even be able to calculate the character whose UTF-8 encoding
> includes the byte at byte index i+m-1 (I don't know Unicode
> well enough to know whether this is possible), but knowing
> that character doesn't help. For the FLBM algorithm to work,
> we have to know the *character* with index i+m-1.
But a FLBM using UTF-8 strings would naturally index and compare
*bytes*, not characters. I.e. the obvious algorithm would
let m be the number of *bytes* in the UTF-8 encoding of s0;
n the total number of bytes in the strings to be searched,
and the i in the results <s,i> would be byte offsets.
The pre-processing would be to generate a table, such that:
For each byte that occurs within s0, the table contains the index of the
last occurrence of that byte within s0.
So the question is whether this is still true: "In the best case,
however, the FLBM algorithm runs in O(n/m lg m) time. This is also the
usual case in practice." I suspect so. Of course n and m are larger,
since they now measure bytes rather than characters, but these are
constant factors that will be dwarfed by other issues (such as cache
misses).
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/