Re: extending the discussion Dan Bornstein 21 Dec 1999 03:40 UTC

I agree with most of what Tom wrote, and don't disagree enough with most of
the rest to feel it worth commenting on. However, I do have a strong
opinion about one part, which I mostly disagree with:

Tom Lord:
>        * In a SRFI which defines "substring/shared", it should be
>          mandatory that the string returned from that procedure share
>          state with the primary string argument.

This part, I agree with. It seems silly to me to explicitly expose this
sort of mechanism and expect people *not* to start relying on it.

>          Two additional procedures are desirable:
>                shared-substring? obj => boolean
>          which tells whether a particular string is a shared
>          substring

I'm don't understand why this procedure is important or useful. I also
imagine it would be hard to implement in a meaningful way. For example, if
I call (substring/shared <string> 0 (string-length <string>))--or something
less obvious with the same result--does the result have to return #t for
shared-substring? If so, that seems to mean that either you can't return
<string> itself for that case (seems like a bad idea to me) or that all
strings need to have a marker to indicate that substring/shared was called
on the string with the entire string as the result (also seems like a bad
idea to me); if not, then how do you know that the "parent" string is, at
the programmer's level of abstraction, shared (which is presumably what
this call is about)? Also, if I call substring/shared on a string and then
that string later becomes garbage, does the result of shared-substring?
change for the substring that is now no longer shared? If not, what does it
*really* mean to be a shared substring? How about the case where two
non-overlapping shared substrings get created from a common parent and then
the parent gets reclaimed?

Another issue is how this might be extended, meaningfully, to the other
cases where sharing is defined, such as in string-append/shared, which is,
in a sense a complementary function to substring/shared. That is, you can
end up with what would seem to be two functionally identical pairs of
objects through either of these sets of calls:

    (define foobarbaz (string-copy "foobarbaz"))
    (define bar (substring/shared foobarbaz 3 6))

    (define bar "bar")
    (define foobarbaz (string-append/shared (string-copy "foo")
                                            (string-copy "baz")))

In the second situation, should (shared-substring? bar) return #t? If so,
it seems to imply a lot of extra mechanism to figure out that fact (seems
like a bad idea to me). If not, it seems like shared-substring? has to lie
some of the time since, at that point, bar really is a shared substring of
foobarbaz (also seems like a bad idea to me).

>                containing-string string => string start end
>          which converts a shared substring to its parent string and
>          indexes, and an ordinary string to itself, 0, and its
>          length.

I have a big problem with this, from a security standpoint. Having this
functionality makes it easy to make the mistake of passing around more than
you intended, which is especially a problem in situations where you don't
necessarily trust the code that you're running (either because it might be
malicious or because it was coded incompetently). Basically, it makes it
too easy to inadvertently and non-obviously share data. The canonical

     assume s = "Userid: danfuzz\nPassword: fuzzball187\n"
     (define userid (substring/shared s 8 15))
     ... lots of random code ...
     (malicious-procedure-masquerading-as-something-innocent userid)

...and then malicious-procedure goes on to call containing-string on userid
and extract the password.

Also, just from the efficiency perspective, making a backlink from a shared
substring to its parent visible to the programmer means that you can't
reclaim the storage for a (possibly much larger) parent until all of its
shared-substring children become garbage. This sort of pattern is common,
e.g., parsing out and keeping references to only the "interesting" bits of
a long string (e.g., read in from a file). In that situation, with
containing-string, the parent string will be unreclaimable deadweight.