Re: Surrogates and character representation

Show/hide message thread

Re: Surrogates and character representation William D Clinger (29 Jul 2005 05:30 UTC)

Re: Surrogates and character representation Per Bothner (29 Jul 2005 05:56 UTC)

Re: Surrogates and character representation William D Clinger (29 Jul 2005 08:33 UTC)

Re: Surrogates and character representation Alex Shinn (29 Jul 2005 14:09 UTC)

Re: Surrogates and character representation Per Bothner 29 Jul 2005 05:56 UTC

William D Clinger wrote:
> Alex Shinn quoting me:
>  > > I acknowledged that the algorithm will still work.  My point
>  > > is that its asymptotic complexity may be degraded.
>  >
>  > I'm sorry, perhaps I'm just misunderstanding, but if the exact
>  > same algorithm, in fact the exact same code, can be used,
>  > how is the asymptotic complexity affected?
>
> The different representations for the string being searched
> (UTF-8 vs UTF-32) change the problem.  The main difference
> is that UTF-32 admits random access of characters, while
> UTF-8 does not.  UTF-8 admits random access of bytes, but
> there is no way to convert a byte value obtained by random
> access into a UTF-8 string into the character offset on
> which the Boyer-Moore algorithm depends,

I don't claim to understand Boyer-Moore, beyond what I gather
is the key insight: if you're searching for "ABC" and character
N is "D" there is no point in checking characters N+1 or N+2.
(Hence the non-obvious part of building the appropriate delta
tables before you start.)

But I'm puzzled, since I think Alex's point is this:  A Boyer-Moore
implementation that works on 8-bits characters (e.g. Latin-1) will
work unchanged on UTF-8 characters.  Naively, one would think it
would have the same performance characteristics.  I guess that
statistics of multi-byte characters might throw off the "delta"
tables so they delta will tend to be smaller.  (However, don't
go into details for my sake: I suspect t would take me too much
effort to delve into it.)

> As Per Bothner agreed, that can degrade the asympototic
> complexity of the algorithm from O(n/m) to O(n).

I did?
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/