John Cowan wrote: > > It's just a matter of pre-computing an index table that maps every > > character index that's a multiple of (for example) 80 to the > > corresponding bytevector index of a UTF-8 representation. > > I don't see how this can be plausibly implemented on top of a Scheme's > conventional strings: they would need to be used with an altogether > disjoint data type, Yes, texts would be a new and disjoint data type. That's why I called them texts instead of strings. > and providing both strings and spans was immediately > rejected by the community. If Scheme programmers are rejecting all new data types, then they'll reject the cursors of SRFI 130. Although cursors might be the same as indexes, they might also be different. The only correct way for application programs to use SRFI 130 is to treat cursors as a new disjoint data type, which is a major reason why it's so hard to use SRFI 130. If you're doing your development in a system that makes cursors the same as indexes, you can't find your mistakes through testing in that system. Once the community understands the advantages of immutable texts with O(1) random access via character indexes, and how well they interoperate with mutable strings (because you can use the same indexes into each), I think the community would be willing to allow someone like you or me to write a new SRFI specifying immutable texts. If that is not allowed, then the SRFI process has changed quite a bit. > The tables would have to be created and > garbage collected under the covers at appropriate times, and it's not > clear when those are, nor how to maintain the necessary relationship > given that most Schemes don't have weak tables. You're inventing an imaginary problem. The tables would be packaged with the texts and garbage-collected under the same rules. As proof your imagined problem is imaginary, you don't even need tables: you can instead represent every text by a vector of bytevectors in which every bytevector except possibly the last contains exactly 80 characters (or whatever size you like; it's a simple time/space tradeoff). That vector and its bytevector elements would never be exposed outside the abstraction, so standard garbage collection would work as well as for any other objects in Scheme. In particular, it would work *better* than it works now for long strings represented in UTF-8 or UTF-16. To understand why would require some understanding of why objects that encapsulate smaller, possibly shared objects perform better and are more space-efficient in garbage-collected systems than larger and more monolithic objects. From your comment above, I infer your belief is the opposite of that fact. > Alternatively, one can use single-byte strings for the Latin-1 > repertoire, double-byte for the BMP repertoire, and quadruple-byte for > the full repertoire. This is indexable very quickly. The only trouble > with it is that it's hard to interchange strings with the surrounding C > or Java or C# environment, which is why people have favored a UTF lately > despite its inefficiency for pure Scheme operations. Yes, conversions between formats would be necessary, just as conversions between multiple formats are necessary in programs that use both Java and C (for example), or in Java programs that run on Linux systems where Unicode files tend to use UTF-8, or in C programs that run on Windows systems where Unicode files tend to use something approximating UTF-16. > > I'll note than none of the sample implementations for SRFI 130 > > actually work if cursors are distinct from indexes, > > The foof implementation is straight from Chibi with the addition of > foof-shim, which provides cursors-as-indexes. If you take that out and > provide native cursors (which in Chibi are specially tagged immediates) > it does work. It works now that I've fixed it. The foof implementation didn't work at all before yesterday, when I fixed numerous bugs, including at least two bugs in which cursors were being passed to R7RS substring, which means those parts of the foof implementation couldn't possibly work if cursors are disjoint from indexes. The foof implementation was clearly not well-tested. Its .sld file, for example, contained two exports of the same name while failing to export several of the names whose export is required by SRFI 130. It also contained numerous bugs that might not have been bugs in some previous revision of SRFI 130, but are definitely bugs with respect to the current revision. I suspect you yourself have done little testing of the foof implementation, and that what testing you may have done has been in non-R7RS systems where such bugs are easily overlooked. > > the basic operations accept both indexes and cursors, often > > as optional arguments so there's a combinatorial explosion of > > possibilities, > > It's an error to mix cursors and indexes in the same call, so there are > really only two possibilities. SRFI 130 says it's "an error unless start and end are both cursors or both indexes", but SRFI 130 doesn't say the four optional parameters of (say) string-prefix-length have to be all cursors or all indexes. That may of course just be an oversight in SRFI 130. In any case, the combinatorial problem has more to do with testing than with inlining. Having just two possibilities for (say) string-cursor-next is already enough complexity to discourage inlining of that and similar procedures, which is why using SRFI 130 instead of character indexes will probably make programs run slower instead of faster in the more performant implementations of R7RS and R6RS. I don't have any problem with the adoption of SRFI 130 as is, but I do have a problem with promoting it as though its use of cursors is going to make string processing run faster. Will