Re: New draft of SRFI 130: Cursor-based string library

Show/hide message thread

New draft of SRFI 130: Cursor-based string library Arthur A. Gleckler (14 May 2016 16:07 UTC)

Re: New draft of SRFI 130: Cursor-based string library Alex Shinn (14 May 2016 22:44 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (21 May 2016 06:53 UTC)

Re: New draft of SRFI 130: Cursor-based string library Alex Shinn (21 May 2016 16:38 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (21 May 2016 17:01 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (21 May 2016 17:36 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (22 May 2016 04:23 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (21 May 2016 17:23 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (22 May 2016 06:38 UTC)

Re: New draft of SRFI 130: Cursor-based string library Alex Shinn (23 May 2016 02:49 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (23 May 2016 03:50 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (23 May 2016 04:30 UTC)

Re: New draft of SRFI 130: Cursor-based string library Alex Shinn (23 May 2016 04:56 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (23 May 2016 13:19 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (23 May 2016 15:45 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (23 May 2016 16:52 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (23 May 2016 18:01 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (23 May 2016 20:32 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger 23 May 2016 04:30 UTC

I have updated my test program for SRFI 130, and have made
several corrections to one of its sample implementations.
I will issue a pull request after I have repaired errors it
found within the other sample implementation.

Alex Shinn quoting me:

> > As for whether we really need SRFI 130, that's a different
> > question.  I think the better approach is to define a new
> > data type of immutable texts, using strings only for those
> > few use cases that really need mutation.  Immutable texts
> > can combine O(1)-time random access with the spatial efficiency
> > associated with UTF-8.
>
> Could you elaborate?  I haven't seen any proposals that
> could guarantee O(1)-time random access and still allow
> UTF-8 or UTF-16.  I've seen proposals that optimize for
> the ASCII-only case, but this isn't very encouraging to
> people who work with multi-lingual text.

John's proposal doesn't require O(1)-time random access,
but it's pretty straightforward to implement his proposal
with O(1)-time random access (by character index as with
Scheme's traditional strings).  It's just a matter of
pre-computing an index table that maps every character
index that's a multiple of (for example) 80 to the
corresponding bytevector index of a UTF-8 representation.
That yields an O(1) limit on the number of characters that
need to be scanned per reference, with space overhead of
three or four words per text object plus at most 10% the
size of a UTF-8 representation.  (For texts of less than
80 characters, that overhead reduces to at most two words
per text object with 0% overhead for the index table,
because you don't need the index table for texts shorter
than 80 characters.  You can also share a single table
amongst all ASCII texts of reasonable size.)

I implemented and tested this over a year ago.  It's in a
lot better shape than the sample implementations for SRFI
130.

It isn't as fast as using four bytes per character, but
it can be a lot faster than using naive UTF-8 or UTF-16,
and it's a lot easier to use than cursors.  As evidence
for its ease of use, I'll note than none of the sample
implementations for SRFI 130 actually work if cursors
are distinct from indexes, while my implementations of
John's proposal provide the performance benefits that
have been claimed for cursors while continuing to use
character indexes.

Part of that performance advantage comes from hiding all
manipulation of cursor-like things behind the abstraction,
where they can be implemented in terms of low-level
operations for which performant systems will generate
inline code.  Because SRFI 130 exposes cursors within
application code, programs that use SRFI 130 will make
many more close-coded procedure calls than programs that
stick to operations provided by (scheme base) and other
basic libraries.  (Implementations are of course free to
generate inline code for SRFI 130 procedures, but that's
unlikely to happen because the basic operations accept
both indexes and cursors, often as optional arguments so
there's a combinatorial explosion of possibilities, and
that degree of ad hoc polymorphism does not lend itself
to generation of inline code.)

The third paragraph of SRFI 130's rationale is somewhat
misleading because it encourages programmers to think
SRFI 130 will greatly improve the performance of string
processing in implementations that support full Unicode.
In reality, programs that use SRFI 130 will run *slower*
in the more performant systems than they would run using
Scheme's traditional index-based operations for Unicode.

It sounds as though John's focus on shared substrings
(in his spans proposal) may have obscured the more
important benefit of combining space efficiency with
fast access and traversal.

Will