Hi all,

Ropes are an O(log n) random-access, O(log n) concatenation structure for string representation.  I assume you already know about ropes.

Consider a rope implementation where the following constructors exist (note that we probably need a "length" field and some way of tracking balance for cat nodes, but let's elide those for now).

    (cat r1 r2) - The concatenation of ropes r1 and r2
    (ascii byte ...) - A rope composed of ASCII characters, represented as a bytevector with 1 byte per character
    (bmp two-bytes ...) - A rope composed of codepoints in the BMP, represented as a bytevector with 2 bytes per character (not UTF-16!)
    (utf08 byte ...) - A rope composed of arbitrary Unicode codepoints, represented as a bytevector using the UTF-8 variable-length encoding

Now consider the use of a text composed entirely of ASCII characters, which we expect to be a very common case.  That would then be very well represented by the (ascii byte ...) constructor.

Consider another text composed of some number of non-ASCII characters in the BMP, which are much rarer than ASCII strings but still "reasonably common".  This would be well represented by the (bmp two-bytes ...) constructor.

Consider another text composed entirely of the single character "🐱", U+1F431 [Footnote 1].  This is represented with a 4-byte (utf8 byte ...) constructor.

Now Clinger's technique can also include such optimized representations, so that strings composed entirely of ASCII characters or at most BMP characters won't need to be represented by a UTF-8 bytevector with an attached acceleration vector, and have "true" O(1) access.

But consider the concatenation of those 3 texts.  How would we represent it?

With ropes, we can represent this with (cat (ascii byte ...) (cat (bmp two-bytes ...) (utf8 byte ...)))

In Clinger's technique, the concatenation would require all of them to use the most generic UTF-8 encoding.  If the total concatenated string is less than 80 (or whatever length is used for subparts) in length then Clinger's representation has O(n) text-ref.

On the other hand, with ropes, if we index inside the ASCII part, we can get true O(1) text-ref.  We need to traverse the (cat r1 r2) nodes a little, but effectively we can get to the characters in 2 operations for this case.

Converting from standard Scheme strings to these ropes would be O(n), mostly because we'd be scanning to check if the Scheme string can be implemented using the ASCII representation, or the BMP representation, a combination, and so on.  Ropes also allow more sharing, not just between strings and substrings, but also strings and their concatenations.

This is all really just an argument: is O(1) text-ref truly practical?

I think that it's better to leave text-ref at O(log n) or even O(n), and instead require *traversal* to be O(n).

For example, most string operations I know of require traversal in one direction only.  So instead, I propose these operations:

    (make-text-cursor text) -> text-cursor  O(log n)

Create a text cursor for traversing a text, pointing to the first character.

    (make-text-cursor-at-end text) -> text cursor  O(log n)

Create a text cursor for traversing a text, pointing to one past the last character.

    (text-cursor? datum) -> bool  O(1)

Return true if the datum is a text cusor.

    (text-cursor-copy text-cursor) -> text-cursor  O(1)

Copy the text cursor's state.  The original and the copy may traverse the same text independently of each other.

    (text-cursor-at-first? text-cursor) -> bool  O(1)

Return true if the cursor is at the first character of the text.

    (text-cursor-at-end? text-cursor) -> bool  O(1)

Return true if the cursor is one past the end of the text.

    (text-cursor-ref text-cursor) -> character  O(1)

Return the character at the cursor.  It is an error to call this procedure if text-cursor-at-end? would return true for the given cursor.

    (text-cursor-next! text-cursor) -> unspecified  O(1)
    (text-cursor-next! text-cursor exact-number) -> unspecified  O(n)

Move the text-cursor forward in the text.  If exact-number is given, advance it by that number of characters.  It is an error for exact-number to be 0, negative, infinite, or not a number.  O(n) for this case is O(n) on the exact-number given.

    (text-cursor-prev! text-cursor) -> unspecified  O(1)
    (text-cursor-prev! text-cursor exact-number) -> unspecified  O(n)

Move the text-cursor backward in the text, similar to text-cursor-next!

    (text-cursor=? text-cursor1 text-cursor2) -> bool  O(1)

Return true if the text cursors are pointing to the same character position in the text.  It is an error (which may or may not be signaled) if the text-cursors are traversing non-identical strings.

This API can be implemented on top of either ropes or Clinger's technique, while achieving the specified complexity.  I suspect it is practical to build most of the existing text API on top of this one.


Sincerely,
AmkG


[Footnote 1] Because I'm a cat, that's why.  Meow.