The purpose of this note is to show how random access
to the characters of a string improves the efficiency
of string searching beyond what is possible with random
access to the bytes of a UTF-8 string.
This can be demonstrated using the Boyer-Moore algorithm
for fast string searching, but that algorithm is fairly
hard to analyze. For this note, I use a simplified form
of the Boyer-Moore algorithm, which I call the Freshman-
Level Boyer-Moore (FLBM) algorithm because it is simple
enough to be explained in freshman-level courses. The
difference between the full Boyer-Moore algorithm and
the FLBM algorithm is that the automaton used in the FLBM
algorithm returns to its start state following every random
access into the string being searched.
The FLBM algorithm
------------------
Given: a string s0 of length m for which we are to search,
and a set of strings to be searched for occurrences of s0
as a substring. Let n be the total length of the strings
to be searched.
Returns: a set of pairs of the form <s,i>, where s is one
of the strings that was searched, i is an index into s, and
s0 is a substring of s beginning at i.
Preprocessing
-------------
The FLBM algorithm begins by pre-processing s0 to compute
the following table. For each character that occurs within
s0, the table contains the index of the last occurrence of
that character within s0.
Using red-black search trees, this table can be constructed
in O(m lg m) time, can be searched in O(lg m) time, and
occupies O(m) space. Hence the preprocessing step takes
O(m lg m) time and O(m) space.
Initialization
--------------
To search for s0 within a string s, set the initial value
of i to 0.
Main step
---------
If i+m-1 is less than the number of characters within s,
let s[i+m-1] be the (i+m-1)th character of s. If s[i+m-1]
does not occur within s0, then set i to i+m. Otherwise let
k be the index of the last occurrence of s[i+m-1] within s0,
and set i to i+m-1-k.
With random access to the characters of s, the operation
described by the previous paragraph runs in O(lg m) time.
With random access to the bytes of s represented in UTF-8,
however, the previous paragraph takes O(m) time. The problem
with UTF-8 is that we can't access s[i+m-1] without looking
at all of the bytes that encode the characters that precede
it. We can look at the byte with index i+m-1, and we might
even be able to calculate the character whose UTF-8 encoding
includes the byte at byte index i+m-1 (I don't know Unicode
well enough to know whether this is possible), but knowing
that character doesn't help. For the FLBM algorithm to work,
we have to know the *character* with index i+m-1. Without
knowledge of that character, we cannot use our table to
compute the new value of i. Furthermore we cannot fall back
on preprocessing the UTF-8 representation of s0 and using
that table instead, because the computation of the new value
for i would depend upon knowing that all characters of s are
encoded using some fixed number of bytes, which is not true
of UTF-8.
Now to finish the explanation of the algorithm. If k=m-1,
then setting i to i+m-1-k did not change the value of i.
In that case, and in that case only, we use a naive string
comparison to see whether s0 is a substring of s starting
at index i. If it is, then we have found a match; whether
we have found a match or not, we set i to i+1 and continue.
Then repeat the main step of the algorithm. Stop when i+m-1
is no longer a valid character index into s.
Analysis
--------
I will assume m < n. (If this assumption is not true, then the
problem is vacuous: s0 cannot be a substring of any string to
be searched.)
In the worst case, the FLBM algorithm requires O(mn lg m) time.
In the best case, however, the FLBM algorithm runs in O(n/m lg m)
time. This is also the usual case in practice.
With UTF-8 encoding, the FLBM algorithm requires O(mn lg m) time
in the worst case, and O(n lg m) time in the best case. In the
usual case, FLBM/UTF-8 takes O(n lg m) time. This usual-case
performance for UTF-8 is asymptotically worse than for encodings
that allow random access to characters.
The FLBM algorithm requires O(m) space regardless of the
encoding used for strings.
Will