constant-time access to variable-width encodings
Per Bothner
(13 Jul 2005 18:13 UTC)
|
Re: constant-time access to variable-width encodings
Ray Blaak
(13 Jul 2005 18:48 UTC)
|
Re: constant-time access to variable-width encodings
Shiro Kawai
(13 Jul 2005 20:16 UTC)
|
Re: constant-time access to variable-width encodings
Per Bothner
(13 Jul 2005 20:36 UTC)
|
Re: constant-time access to variable-width encodings
Shiro Kawai
(13 Jul 2005 23:07 UTC)
|
Re: constant-time access to variable-width encodings bear (14 Jul 2005 00:23 UTC)
|
Re: constant-time access to variable-width encodings
Per Bothner
(14 Jul 2005 00:39 UTC)
|
Re: constant-time access to variable-width encodings
bear
(14 Jul 2005 01:52 UTC)
|
Re: constant-time access to variable-width encodings
Thomas Bushnell BSG
(14 Jul 2005 07:18 UTC)
|
Re: constant-time access to variable-width encodings
Thomas Bushnell BSG
(14 Jul 2005 07:16 UTC)
|
On Wed, 13 Jul 2005, Per Bothner wrote: > Shiro Kawai wrote: >> I feel a bit uncomfortable, though, with the fact that indexes >> and string-length differ among different implementations, or >> even in the same implementations with different character >> encodings. > I can see an issue if you try to write that out using one > implementation, and read it back in with another. Not sure how > important that is. Actually, it's supposed to be a non-problem for unicode-compliant applications, because the unicode string equivalence algorithm is *required* to treat strings as equivalent regardless of how the graphemes within them are encoded. Speaking of which, the current draft of the SRFI is not unicode-compliant in that its string=? predicate does not detect strings which are "canonically equivalent" according to the Unicode Consortium's required string equivalence algorithm. They define strings as equal if they contain a sequence of graphemes which are equivalent, and you're defining strings as equal if they contain a sequence of codepoints which are equivalent. Aaaand, this is yet another problem that goes away if you embrace glyph=character instead of codepoint=character. With Unicode, you *CANNOT* make assumptions about how strings are represented. Two strings which are "equal" under unicode's required equivalence predicates may be of different lengths and have not a single codepoint in common, and the differences are purely representation artifacts. If you embrace glyph=character then at least a given string will portably be a fixed number of characters, and a unicode-aware char=? predicate can bury representation artifacts below the level of notice of the programmer or user. Bear