Re: freshman-level Boyer-Moore fast string search
William D Clinger 29 Jul 2005 08:33 UTC
Alex Shinn wrote:
> Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never
> see 4-byte codepoints, and 1 byte codepoints are extremely common.
Sorry, I was thinking UTF-8 is 1-6 bytes, but that's out of date.
> Is the three-stage assumption for binary trees?
No. Stage 1: If the Unicode scalar value is less than 256, then
use it as an index into an array. Stage 2: If it's greater than 255,
then look it up in a hash table or balanced search tree. (A third
stage isn't necessary, but I thought it might suggest to you how
dividing a 21-bit scalar value into three 7-bit chunks can simulate,
with no real loss of efficiency apart from memory considerations,
the UTF-8 encoding that you thought was faster.)
> You can't just dismiss a lg(m) factor.
I'm not. I put that factor into my description of the FLBM algorithm
just to show you how easy it is to do better than O(m+Sigma) time and
space during the preprocessing phase. Please remember that the whole
point of that algorithm is to make the basic ideas easy enough for
freshmen to understand. In a real implementation of the real
Boyer-Moore algorithm, I would use something at least as fast as the
two-stage lookup described above.
> Out of curiosity, what string representations does SRFI-75 penalize
> which you consider to be poor?
Suppose each string s is represented by a vector of 2^21 elements,
where element i consists of a list of numbers, in IEEE double
precision, that represent the indexes within s at which the
character c appears, where c is the Unicode scalar value f(i),
where f is represented by a global association list that maps
scalar values to indexes (i.e. f-inverse). SRFI-75 allows that
representation, yet penalizes it. I also consider it to be a
poor representation.
Will