Re: Surrogates and character representation
William D Clinger
(29 Jul 2005 05:30 UTC)
|
Re: Surrogates and character representation Per Bothner (29 Jul 2005 05:56 UTC)
|
Re: Surrogates and character representation
William D Clinger
(29 Jul 2005 08:33 UTC)
|
Re: Surrogates and character representation
Alex Shinn
(29 Jul 2005 14:09 UTC)
|
Re: Surrogates and character representation Per Bothner 29 Jul 2005 05:56 UTC
William D Clinger wrote: > Alex Shinn quoting me: > > > I acknowledged that the algorithm will still work. My point > > > is that its asymptotic complexity may be degraded. > > > > I'm sorry, perhaps I'm just misunderstanding, but if the exact > > same algorithm, in fact the exact same code, can be used, > > how is the asymptotic complexity affected? > > The different representations for the string being searched > (UTF-8 vs UTF-32) change the problem. The main difference > is that UTF-32 admits random access of characters, while > UTF-8 does not. UTF-8 admits random access of bytes, but > there is no way to convert a byte value obtained by random > access into a UTF-8 string into the character offset on > which the Boyer-Moore algorithm depends, I don't claim to understand Boyer-Moore, beyond what I gather is the key insight: if you're searching for "ABC" and character N is "D" there is no point in checking characters N+1 or N+2. (Hence the non-obvious part of building the appropriate delta tables before you start.) But I'm puzzled, since I think Alex's point is this: A Boyer-Moore implementation that works on 8-bits characters (e.g. Latin-1) will work unchanged on UTF-8 characters. Naively, one would think it would have the same performance characteristics. I guess that statistics of multi-byte characters might throw off the "delta" tables so they delta will tend to be smaller. (However, don't go into details for my sake: I suspect t would take me too much effort to delve into it.) > As Per Bothner agreed, that can degrade the asympototic > complexity of the algorithm from O(n/m) to O(n). I did? -- --Per Bothner xxxxxx@bothner.com http://per.bothner.com/