Re: Surrogates and character representation
William D Clinger 29 Jul 2005 08:33 UTC
> But I'm puzzled, since I think Alex's point is this: A Boyer-Moore
> implementation that works on 8-bits characters (e.g. Latin-1) will
> work unchanged on UTF-8 characters.
Alex was right about that, and I was wrong. (I was wrong
for several reasons, which varied over time. At one point
I thought "unchanged" meant the Scheme code would remain
unchanged while STRING-REF was changed from an O(1) to
an O(n) operation, or vice versa. What he really meant
was that the C code would remain unchanged while the data
was changed from UCS-4 to UTF-8, or something like that.)
> Naively, one would think it
> would have the same performance characteristics....
It does, roughly speaking. Alex, however, claimed that the
Boyer-Moore algorithm actually performs *better* on UTF-8,
because he thought the preprocessing phase requires time
and space proportional to the size of the alphabet. That
was what I understood to be his main point, and it was the
reason I felt I had to respond to him. He was wrong about
that.
> > As Per Bothner agreed, that can degrade the asympototic
> > complexity of the algorithm from O(n/m) to O(n).
>
> I did?
No, you didn't. I misunderstood someone else's post; on top
of that, I saw your name in the cc: field and thought it was
the from: field. Sorry.
Will