Re: New draft of SRFI 130: Cursor-based string library

Show/hide message thread

New draft of SRFI 130: Cursor-based string library Arthur A. Gleckler (14 May 2016 16:07 UTC)

Re: New draft of SRFI 130: Cursor-based string library Alex Shinn (14 May 2016 22:44 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (21 May 2016 06:53 UTC)

Re: New draft of SRFI 130: Cursor-based string library Alex Shinn (21 May 2016 16:38 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (21 May 2016 17:01 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (21 May 2016 17:36 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (22 May 2016 04:23 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (21 May 2016 17:23 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (22 May 2016 06:38 UTC)

Re: New draft of SRFI 130: Cursor-based string library Alex Shinn (23 May 2016 02:49 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (23 May 2016 03:50 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (23 May 2016 04:30 UTC)

Re: New draft of SRFI 130: Cursor-based string library Alex Shinn (23 May 2016 04:56 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (23 May 2016 13:19 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (23 May 2016 15:45 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (23 May 2016 16:52 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger (23 May 2016 18:01 UTC)

Re: New draft of SRFI 130: Cursor-based string library John Cowan (23 May 2016 20:32 UTC)

Re: New draft of SRFI 130: Cursor-based string library William D Clinger 23 May 2016 15:45 UTC

John Cowan wrote:

> > It's just a matter of pre-computing an index table that maps every
> > character index that's a multiple of (for example) 80 to the
> > corresponding bytevector index of a UTF-8 representation.
>
> I don't see how this can be plausibly implemented on top of a Scheme's
> conventional strings: they would need to be used with an altogether
> disjoint data type,

Yes, texts would be a new and disjoint data type.  That's why I called
them texts instead of strings.

> and providing both strings and spans was immediately
> rejected by the community.

If Scheme programmers are rejecting all new data types, then they'll
reject the cursors of SRFI 130.  Although cursors might be the same
as indexes, they might also be different.  The only correct way for
application programs to use SRFI 130 is to treat cursors as a new
disjoint data type, which is a major reason why it's so hard to use
SRFI 130.  If you're doing your development in a system that makes
cursors the same as indexes, you can't find your mistakes through
testing in that system.

Once the community understands the advantages of immutable texts with
O(1) random access via character indexes, and how well they interoperate
with mutable strings (because you can use the same indexes into each),
I think the community would be willing to allow someone like you or me
to write a new SRFI specifying immutable texts.  If that is not allowed,
then the SRFI process has changed quite a bit.

> The tables would have to be created and
> garbage collected under the covers at appropriate times, and it's not
> clear when those are, nor how to maintain the necessary relationship
> given that most Schemes don't have weak tables.

You're inventing an imaginary problem.  The tables would be packaged
with the texts and garbage-collected under the same rules.  As proof
your imagined problem is imaginary, you don't even need tables: you
can instead represent every text by a vector of bytevectors in which
every bytevector except possibly the last contains exactly 80 characters
(or whatever size you like; it's a simple time/space tradeoff).  That
vector and its bytevector elements would never be exposed outside the
abstraction, so standard garbage collection would work as well as for
any other objects in Scheme.

In particular, it would work *better* than it works now for long strings
represented in UTF-8 or UTF-16.  To understand why would require some
understanding of why objects that encapsulate smaller, possibly shared
objects perform better and are more space-efficient in garbage-collected
systems than larger and more monolithic objects.  From your comment above,
I infer your belief is the opposite of that fact.

> Alternatively, one can use single-byte strings for the Latin-1
> repertoire, double-byte for the BMP repertoire, and quadruple-byte for
> the full repertoire.  This is indexable very quickly.  The only trouble
> with it is that it's hard to interchange strings with the surrounding C
> or Java or C# environment, which is why people have favored a UTF lately
> despite its inefficiency for pure Scheme operations.

Yes, conversions between formats would be necessary, just as conversions
between multiple formats are necessary in programs that use both Java
and C (for example), or in Java programs that run on Linux systems where
Unicode files tend to use UTF-8, or in C programs that run on Windows
systems where Unicode files tend to use something approximating UTF-16.

> > I'll note than none of the sample implementations for SRFI 130
> > actually work if cursors are distinct from indexes,
>
> The foof implementation is straight from Chibi with the addition of
> foof-shim, which provides cursors-as-indexes.  If you take that out and
> provide native cursors (which in Chibi are specially tagged immediates)
> it does work.

It works now that I've fixed it.  The foof implementation didn't work
at all before yesterday, when I fixed numerous bugs, including at least
two bugs in which cursors were being passed to R7RS substring, which
means those parts of the foof implementation couldn't possibly work if
cursors are disjoint from indexes.

The foof implementation was clearly not well-tested.  Its .sld file,
for example, contained two exports of the same name while failing to
export several of the names whose export is required by SRFI 130.  It
also contained numerous bugs that might not have been bugs in some
previous revision of SRFI 130, but are definitely bugs with respect
to the current revision.  I suspect you yourself have done little testing
of the foof implementation, and that what testing you may have done has
been in non-R7RS systems where such bugs are easily overlooked.

> > the basic operations accept both indexes and cursors, often
> > as optional arguments so there's a combinatorial explosion of
> > possibilities,
>
> It's an error to mix cursors and indexes in the same call, so there are
> really only two possibilities.

SRFI 130 says it's "an error unless start and end are both cursors or both
indexes", but SRFI 130 doesn't say the four optional parameters of (say)
string-prefix-length have to be all cursors or all indexes.  That may of
course just be an oversight in SRFI 130.  In any case, the combinatorial
problem has more to do with testing than with inlining.  Having just two
possibilities for (say) string-cursor-next is already enough complexity
to discourage inlining of that and similar procedures, which is why using
SRFI 130 instead of character indexes will probably make programs run
slower instead of faster in the more performant implementations of R7RS
and R6RS.

I don't have any problem with the adoption of SRFI 130 as is, but I do
have a problem with promoting it as though its use of cursors is going
to make string processing run faster.

Will