Email list hosting service & mailing list manager

constant-time access to variable-width encodings Per Bothner (13 Jul 2005 18:13 UTC)
Re: constant-time access to variable-width encodings Ray Blaak (13 Jul 2005 18:48 UTC)
Re: constant-time access to variable-width encodings Shiro Kawai (13 Jul 2005 20:16 UTC)
Re: constant-time access to variable-width encodings Per Bothner (13 Jul 2005 20:36 UTC)
Re: constant-time access to variable-width encodings Shiro Kawai (13 Jul 2005 23:07 UTC)
Re: constant-time access to variable-width encodings Per Bothner (14 Jul 2005 00:39 UTC)
Re: constant-time access to variable-width encodings Thomas Bushnell BSG (14 Jul 2005 07:18 UTC)
Re: constant-time access to variable-width encodings Thomas Bushnell BSG (14 Jul 2005 07:16 UTC)

Re: constant-time access to variable-width encodings Per Bothner 13 Jul 2005 20:35 UTC

Shiro Kawai wrote:
> I feel a bit uncomfortable, though, with the fact that indexes and
> string-length differ among different implementations, or even in the
> same implementations with different character encodings.

I'm assuming a single character encoding per implementation: either
UTF-8, UTF-16, or a plain array of 20-bit characters.  Supporting
general character encodings is problematic, since you cannot always tell
if a byte is an initial or subsequent (partial) character.

In explaining/specifying my proposal it might be useful to add:
(define (char-representation-size ch)
   ;; Implementations will do this more efficiently!
   (string-length (make-string 1 ch)))

 > It makes a datastructure that holds a string and its indexes
non-portable, for example.

I can see an issue if you try to write that out using one
implementation, and read it back in with another.  Not sure how
important that is.

> I'd agree the proposal if it introduces a different means of
> indexing, other than character count used for string-ref.  Call it
> 'offset' for now.  string-offset-ref, substring-offset etc. would
> provide offset-based operation, while string-ref, substring etc.
> work on character-based op.

That might be reasonable.  But ...

> Though it may be too cumbersome for
> core language.

Well, the complication is that existing code will be less efficient, and
people have a choice between using string-ref (portable to R5RS but
potentially slow) and string-offset-ref (portable to R6RS only but fast).

An alternative idea is to have a cache that maps the most recent (char
index, offset) mapping.  One problem is that even an immutable string
now requires a mutable cache, with possible synchronization issues.

>  And this is too much variable-length-character centric
> API, which fixed-length character implementation or other
> implementations (such as tree of segments) wouldn't care much.

Not sure your point.  Certainly a more complex data structure is
appropriate for (say) a text editor, especially once you support
character "attributes".
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/