constant-time access to variable-width encodings
Per Bothner
(13 Jul 2005 18:13 UTC)
|
Re: constant-time access to variable-width encodings
Ray Blaak
(13 Jul 2005 18:48 UTC)
|
Re: constant-time access to variable-width encodings
Shiro Kawai
(13 Jul 2005 20:16 UTC)
|
Re: constant-time access to variable-width encodings Per Bothner (13 Jul 2005 20:36 UTC)
|
Re: constant-time access to variable-width encodings
Shiro Kawai
(13 Jul 2005 23:07 UTC)
|
Re: constant-time access to variable-width encodings
bear
(14 Jul 2005 00:23 UTC)
|
Re: constant-time access to variable-width encodings
Per Bothner
(14 Jul 2005 00:39 UTC)
|
Re: constant-time access to variable-width encodings
bear
(14 Jul 2005 01:52 UTC)
|
Re: constant-time access to variable-width encodings
Thomas Bushnell BSG
(14 Jul 2005 07:18 UTC)
|
Re: constant-time access to variable-width encodings
Thomas Bushnell BSG
(14 Jul 2005 07:16 UTC)
|
Shiro Kawai wrote: > I feel a bit uncomfortable, though, with the fact that indexes and > string-length differ among different implementations, or even in the > same implementations with different character encodings. I'm assuming a single character encoding per implementation: either UTF-8, UTF-16, or a plain array of 20-bit characters. Supporting general character encodings is problematic, since you cannot always tell if a byte is an initial or subsequent (partial) character. In explaining/specifying my proposal it might be useful to add: (define (char-representation-size ch) ;; Implementations will do this more efficiently! (string-length (make-string 1 ch))) > It makes a datastructure that holds a string and its indexes non-portable, for example. I can see an issue if you try to write that out using one implementation, and read it back in with another. Not sure how important that is. > I'd agree the proposal if it introduces a different means of > indexing, other than character count used for string-ref. Call it > 'offset' for now. string-offset-ref, substring-offset etc. would > provide offset-based operation, while string-ref, substring etc. > work on character-based op. That might be reasonable. But ... > Though it may be too cumbersome for > core language. Well, the complication is that existing code will be less efficient, and people have a choice between using string-ref (portable to R5RS but potentially slow) and string-offset-ref (portable to R6RS only but fast). An alternative idea is to have a cache that maps the most recent (char index, offset) mapping. One problem is that even an immutable string now requires a mutable cache, with possible synchronization issues. > And this is too much variable-length-character centric > API, which fixed-length character implementation or other > implementations (such as tree of segments) wouldn't care much. Not sure your point. Certainly a more complex data structure is appropriate for (say) a text editor, especially once you support character "attributes". -- --Per Bothner xxxxxx@bothner.com http://per.bothner.com/