Re: freshman-level Boyer-Moore fast string search

Re: freshman-level Boyer-Moore fast string search William D Clinger 29 Jul 2005 08:33 UTC
Alex Shinn wrote:
 > Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never
 > see 4-byte codepoints, and 1 byte codepoints are extremely common.

Sorry, I was thinking UTF-8 is 1-6 bytes, but that's out of date.

 > Is the three-stage assumption for binary trees?

No.  Stage 1: If the Unicode scalar value is less than 256, then
use it as an index into an array.  Stage 2: If it's greater than 255,
then look it up in a hash table or balanced search tree.  (A third
stage isn't necessary, but I thought it might suggest to you how
dividing a 21-bit scalar value into three 7-bit chunks can simulate,
with no real loss of efficiency apart from memory considerations,
the UTF-8 encoding that you thought was faster.)

 > You can't just dismiss a lg(m) factor.

I'm not.  I put that factor into my description of the FLBM algorithm
just to show you how easy it is to do better than O(m+Sigma) time and
space during the preprocessing phase.  Please remember that the whole
point of that algorithm is to make the basic ideas easy enough for
freshmen to understand.  In a real implementation of the real
Boyer-Moore algorithm, I would use something at least as fast as the
two-stage lookup described above.

 > Out of curiosity, what string representations does SRFI-75 penalize
 > which you consider to be poor?

Suppose each string s is represented by a vector of 2^21 elements,
where element i consists of a list of numbers, in IEEE double
precision, that represent the indexes within s at which the
character c appears, where c is the Unicode scalar value f(i),
where f is represented by a global association list that maps
scalar values to indexes (i.e. f-inverse).  SRFI-75 allows that
representation, yet penalizes it.  I also consider it to be a
poor representation.

Will