Re: New draft of SRFI 130: Cursor-based string library William D Clinger 23 May 2016 04:30 UTC
I have updated my test program for SRFI 130, and have made several corrections to one of its sample implementations. I will issue a pull request after I have repaired errors it found within the other sample implementation. Alex Shinn quoting me: > > As for whether we really need SRFI 130, that's a different > > question. I think the better approach is to define a new > > data type of immutable texts, using strings only for those > > few use cases that really need mutation. Immutable texts > > can combine O(1)-time random access with the spatial efficiency > > associated with UTF-8. > > Could you elaborate? I haven't seen any proposals that > could guarantee O(1)-time random access and still allow > UTF-8 or UTF-16. I've seen proposals that optimize for > the ASCII-only case, but this isn't very encouraging to > people who work with multi-lingual text. John's proposal doesn't require O(1)-time random access, but it's pretty straightforward to implement his proposal with O(1)-time random access (by character index as with Scheme's traditional strings). It's just a matter of pre-computing an index table that maps every character index that's a multiple of (for example) 80 to the corresponding bytevector index of a UTF-8 representation. That yields an O(1) limit on the number of characters that need to be scanned per reference, with space overhead of three or four words per text object plus at most 10% the size of a UTF-8 representation. (For texts of less than 80 characters, that overhead reduces to at most two words per text object with 0% overhead for the index table, because you don't need the index table for texts shorter than 80 characters. You can also share a single table amongst all ASCII texts of reasonable size.) I implemented and tested this over a year ago. It's in a lot better shape than the sample implementations for SRFI 130. It isn't as fast as using four bytes per character, but it can be a lot faster than using naive UTF-8 or UTF-16, and it's a lot easier to use than cursors. As evidence for its ease of use, I'll note than none of the sample implementations for SRFI 130 actually work if cursors are distinct from indexes, while my implementations of John's proposal provide the performance benefits that have been claimed for cursors while continuing to use character indexes. Part of that performance advantage comes from hiding all manipulation of cursor-like things behind the abstraction, where they can be implemented in terms of low-level operations for which performant systems will generate inline code. Because SRFI 130 exposes cursors within application code, programs that use SRFI 130 will make many more close-coded procedure calls than programs that stick to operations provided by (scheme base) and other basic libraries. (Implementations are of course free to generate inline code for SRFI 130 procedures, but that's unlikely to happen because the basic operations accept both indexes and cursors, often as optional arguments so there's a combinatorial explosion of possibilities, and that degree of ad hoc polymorphism does not lend itself to generation of inline code.) The third paragraph of SRFI 130's rationale is somewhat misleading because it encourages programmers to think SRFI 130 will greatly improve the performance of string processing in implementations that support full Unicode. In reality, programs that use SRFI 130 will run *slower* in the more performant systems than they would run using Scheme's traditional index-based operations for Unicode. It sounds as though John's focus on shared substrings (in his spans proposal) may have obscured the more important benefit of combining space efficiency with fast access and traversal. Will