Re: constant-time access to variable-width encodings
Per Bothner 13 Jul 2005 20:35 UTC
Shiro Kawai wrote:
> I feel a bit uncomfortable, though, with the fact that indexes and
> string-length differ among different implementations, or even in the
> same implementations with different character encodings.
I'm assuming a single character encoding per implementation: either
UTF-8, UTF-16, or a plain array of 20-bit characters. Supporting
general character encodings is problematic, since you cannot always tell
if a byte is an initial or subsequent (partial) character.
In explaining/specifying my proposal it might be useful to add:
(define (char-representation-size ch)
;; Implementations will do this more efficiently!
(string-length (make-string 1 ch)))
> It makes a datastructure that holds a string and its indexes
non-portable, for example.
I can see an issue if you try to write that out using one
implementation, and read it back in with another. Not sure how
important that is.
> I'd agree the proposal if it introduces a different means of
> indexing, other than character count used for string-ref. Call it
> 'offset' for now. string-offset-ref, substring-offset etc. would
> provide offset-based operation, while string-ref, substring etc.
> work on character-based op.
That might be reasonable. But ...
> Though it may be too cumbersome for
> core language.
Well, the complication is that existing code will be less efficient, and
people have a choice between using string-ref (portable to R5RS but
potentially slow) and string-offset-ref (portable to R6RS only but fast).
An alternative idea is to have a cache that maps the most recent (char
index, offset) mapping. One problem is that even an immutable string
now requires a mutable cache, with possible synchronization issues.
> And this is too much variable-length-character centric
> API, which fixed-length character implementation or other
> implementations (such as tree of segments) wouldn't care much.
Not sure your point. Certainly a more complex data structure is
appropriate for (say) a text editor, especially once you support
character "attributes".
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/