Round 2 discussion - Simplelists

Show/hide message thread
Round 2 discussion shivers@xxxxxx (04 Nov 1999 13:55 UTC)
Re: Round 2 discussion Per Bothner (04 Nov 1999 22:31 UTC)
Round 2 discussion shivers@xxxxxx 04 Nov 1999 13:55 UTC
Here is a general discussion of the issues raised by Kleberg, Bornstein,
Egorov, Hartrumpf, Welsh & Arvestad. I will summarise the major issues
in a following message. After you read this discussion, please read the
issues message, and send me your comments and votes!

Feel free to flame about things I didn't move over to the "issues" message
if you still want to push them.

    From: Bengt Kleberg <xxxxxx@softwell.se>
    1 Would it be possible to allow all procedures that have a _single_ [start
    end] pair to accept both, ie [start [end]]? Like substring does (only that
    substring must have a start, ofcourse).

What you want is what they are.

    2 Must string-index-right (and string-skip-right) take their [end start]
    pair in that order?  I assume that this is influenced by the list library,
    but if not, please consider using [start end] instead.

The issue here is that you *almost never* specify the left bound; you
*frequently* specify the right bound. You actually *start* with END in the
-right procedures. That is why they are flipped -- you can say
	    (let ((i (string-skip-right s char-set:white i)))
              ...)
In other words, with both the from-left and from-right variants, the
first optional argument says where to begin the search, and the second
optional argument says where to end the search.

Note that if you get it wrong, you will definitely trigger a run-time
error, since you aren't allowed to have a start parameter that's larger
than the end parameter.

    From: Dan Bornstein <xxxxxx@milk.com>
    The new caveat on substring, that it be allowed to share storage, seems
    strange to me, since it could noticeably and adversely affect pre-existing
    programs that rely on the old behavior. Plus, the SRFI specifies "/shared"
    variants of other calls to explicitly call out the possibility of sharing.
    So, why not leave substring alone, and add a substring/shared which is
    allowed (but not required) to share storage with its argument?

    To support pure functional programming, it seems to me that there ought to
    be two new procedures:

	(string-set string i char) -> string
	(string-fill s char [start end]) -> string

    which would be like their "!" brethren, except they return new strings
    instead of altering their arguments.

C'mon. Do you really think that people would use STRING-SET ?
STRING-FILL is an easier case to make. Let's see, that would be

    (define (string-fill s c . maybe-start+end)
      (let ((ans (string-copy s)))
	(apply string-fill! ans c maybe-start+end)
	ans))

Should I add this? I cannot think of any cases where it would be all that
handy.

    I fear the new string-copy! for one main reason: In all other cases in
    Scheme, the only difference between a "!" and unmarked procedure with the
    same name has been the fact that the "!" version modifies (or is at least
    allowed to modify) its first argument and the unmarked version returns a
    new object. In this case, the argument specs are different, and I suspect
    this will be a source of confusion.

    I suggest fixing this by [1] renaming the current string-copy! Off-the-cuff
    suggestion, string-copy-chars! and [2] adding a non "!" version of the same
    procedure (e.g., string-copy-chars), to provide the symmetry that there is
    in the rest of the spec.

Yeah, you're right. However, your non-side-effecting STRING-COPY is subsumed
by the STRING-REPLACE Welsh proposes below. I think I'll leave things as-is.

    From: Donald Welsh <xxxxxx@melbpc.org.au>

    string-take
    string-drop
    string-take-right
    string-drop-right
    substring-move-left
    substring-move-right

    These (if needed) can all be defined as special cases of a more general
    function, string-replace.

    string-replace s1 start end s2 -> string
	replace the substring of s1 from start to end with s2

    A variant, string-replace! could also be defined, which is allowed to reuse
    the original storage of s1.

    With a regexp package, regexp matching could used to find start and end
    points, then string-replace could perform the replacement.

For regexp replacement, you want something a little more sophisticated.
See REGEXP-SUBSTITUTE & REGEXP-SUBSTITUTE/GLOBAL in scsh.

Further, you can do string-take/drop/move with STRING-REPLACE... but it's
awkward. I prefer to name these operations.

Nonetheless, STRING-REPLACE seems like a handy thing. I'll add it, spec'd
as follows, unless I get a lot of complaints:

string-replace s1 start1 end1 s2 [start2 end2] -> string
    Returns (string-append (substring s1 0 start1)
                           (substring s2 start2 end2)
                           (substring s1 end1 (string-length s1)))
    That is, the segment of characters in S1 from START1 to END1
    is replaced by the segment of characters in S2 from START2 to END2.
    If START1=END1, this simply splices the S2 characters into S1 at the
    specified index.

    Examples:
	(string-replace "The TCL programmer endured daily ridicule." 4 7
                        "miserable perl") =>
            "The miserable perl programmer endured daily ridicule."

	(string-replace "It's easy to code it up in Scheme." 5 5 "really ") =>
            "It's really easy to code it up in Scheme."

    ------------

    reverse-string-concatenate string-list [end] -> string
    reverse-string-concatenate/shared string-list [end] -> string

    Why are these included?  Without the optional parameter, the stated
    equivalents are clearer.  With END these functions are odd -- why are they
    needed?

They are included to eliminate the unnecessary consing of the reversed list.
They are also included because *these particular* operations are *exactly*
the utility needed by a wide variety of functions that accumulate character
data into a string result, where one does not know in advance how much
character data will be accumulated. So one accumulates the data into a
linked list of, say, 1k buffers, the assembles the buffers into a final
result at the end. The last buffer (which is first on the reversed list)
may only be partially full -- hence the optional final argument.

    string-filter s char/char-set/pred [start end] -> string
    string-delete s char/char-set/pred [start end] -> string

    The char/char-set/pred isn't intuitive.

Predicates and char-sets are *the* two basic ways to describe a set of
characters, so we handle them both. Handling the case of a single char is just
a common-case convenience hack; not fundamental or really all that important.

Note that there's an efficiency reason to favor char-set's over predicates.
The STRING-FILTER procedure knows that doing a char-set membership test
causes no side effects. So it can do it more than once. This means that
it can use a two-pass strategy: once scan to count up the number of surviving
characters, allocate a string of exactly the right size, then a second scan
to copy the chars into the result string. In contrast, the predicate scan
loop only applies the predicate once to each character, so it has to use
an adaptive strategy for allocating the result, which means it runs slower
and produces garbage (this is the kind of code where one can profitably
use REVERSE-STRING-CONCATENATE/SHARED, by the way).

    If char-sets are needed, perhaps
    there should be a set package SRFI (is there one already?), or the
    functionality could be provided by a regexp package.

I have submitted a char-set SRFI, but it has not yet been announced for
discussion. (This just in: it has just been announced. See
    http://srfi.schemers.org/srfi-14/
It's a much smaller SRFI, but was designed in concert with SRFI-13.)

    Suggest dropping
    string-delete and using an alternative version of string-filter which uses
    a string -> string function f instead of char/char-set/pred.  (String
    deletion can be done by f returning "".)  The function f should go first,
    as with map and string-map.

    string-filter f s [start end] -> string
	construct a string by applying f to each character of s
	function f should take a string of length one as input, and output a
        string (of any length)

This is a cute function. Consider that you could use it for backslash
encoding strings, for example -- just say

    (string-filter (lambda (c) (if (char=? c #\\) (string #\\ c) (string c)))
                   s)

But it's really inefficient; generates lots of garbage. You can
backslash-encode more efficiently by saying

   (let ((ans-len (string-fold (lambda (c sum)
                                 (+ sum (if (char=? c #\\) 2 1)))
                               0 s))
         (ans (make-string ans-len)))
     (string-fold (lambda (c i)
                    (let ((i (if (char=? c #\\)
		                 (begin (string-set! ans i #\\) (+ i 1))
			         i)))
                      (string-set! ans i c)
                      (+ i 1)))
                  0 s)
     ans)

which generates no intermediate garbage strings. Can anyone come up with
some compelling examples of good uses for Donald's STRING-FILTER ? And
I'd need another name for it -- STRING-FILTER and STRING-DELETE are fixed,
in order to maintain parallelism with their list-lib analogues.

    From: "Sergei Egorov" <xxxxxx@informaxinc.com>
    First, I only partially agree with the rationale.
    Yes, RNRS's set of string operations is poor, but
    there's nothing wrong with naming conventions (the
    choice between string= and string=? is mainly a matter
    of taste; the former is compatible with = (=? variants
    existed in R2RS to make everybody happy), but the
    latter is compatible with char=? and other predicates,
    and it is both familiar and standard).

Exactly -- why not, then, be consistent, and rename the numeric functions
	 <? <=? =? >=? >?
Urgh. Looks really ugly to me. It's redundant and ugly and wastes horizontal
margin, which matters when you try to code in an 80-column buffer. But the
worst of these offenses is that it's ugly. I am consistently using the
inequality symbols alone, without an additional ? suffix, in the libraries
I've been proposing -- so I'm being consistent with the numeric preds, not
the others.

    Shared substrings are practically unknown to the Scheme community and,
    even if they are implemented somewhere, I see no reason to make
    incompatible changes to the semantics of SUBSTRING (adding
    SUBSTRING/SHARED is definitely a better idea, but why put normal and
    /SHARED versions in one SRFI?).

Shared substrings are in guile, I believe. T had them. I'm trying to do this
library so that it will efficiently support programming both in
implementations that do and do not have them. It's worth putting in a little
work to support them, because they're a useful feature.

By the way, I should state my goal here: I'm *not* trying to do a complete
shared-text string system; I'm simply trying to do a general string library
that can be efficiently used in both kinds of systems. Note that even if your
Scheme system *doesn't* provide shared-text substrings, you can *still*
profitably use the /shared procedure variants (and I have, in my code -- this
SRFI has some of its roots in the utilities I've written for systems code I've
built for scsh). The /shared variants can exploit special cases, such as
having a substring op return the original string when the substring start/end
indices specify the entire string.

Here's the thing about R5RS SUBSTRING: once we generalise STRING-COPY to take
start/end parameters, it trivially subsumes R5RS SUBSTRING. We should just
throw it away.

You might suggest dropping STRING-COPY (which Dan doesn't like for reasons
of non-parallelism w/STRING-COPY!, anyway), and simply retaining the R5RS
copying SUBSTRING, extending it to make the start/end parameters optional.
I did consider this. But it seems kinda misleading to tell people: to copy
a string, write
  (substring s)
How's that again? That looks very weird; not clear at all -- more of an
oddball, corner-case "idiom" than a straightforward useage. Whereas, these are
all quite clear as to what they do:
      (STRING-COPY s)		; Copy the string. Right. Got it.
      (STRING-COPY s 3)		; Copy the string, starting at index 3. Right.
      (STRING-COPY s 0 end)	; Copy the string from 0 to END. No problem.
So, once again, I've worked my way around to trivially subsuming R5RS
SUBSTRING into STRING-COPY, just by virtue of the *general* library rule
that all procedures take optional start/end parameters where possible and
sensible. R5RS provides *both* STRING-COPY and SUBSTRING -- not too clever,
given how trivial it is to subsume one into the other. But we can exploit
this by assigning these two functions an honest-to-God *useful* distinction.
So that is what I did.

So, we should punt R5RS's SUBSTRING *for sure*. That leaves two avenues open:

- The not-entirely-backwards-compatible one:
  We use the name SUBSTRING for the shared-text variant. This seems like
  a good use of the name. It does have small backwards-compatibility issues
  if you are (1) using a Scheme with shared-text strings and (2) wrote
  code that relied on SUBSTRING returning fresh storage. I think this
  is not a big discontinuity; others may think I'm a danger to society.

- The conservative one:
  Just to *make sure* people don't accidentally use a shared-text
  substring op in old code where they wanted the fresh-storage substring,
  we simply *drop* SUBSTRING entirely, and use SUBSTRING/SHARED in this
  SRFI.

I think you guys are a bunch of wimps. But I'm interested in your opinions.
I believe I know where Egorov & Bornstein stand, yes? Y'all would
probably vote for the conservative path? Would everyone please check in
on this issue?

A last call for not being ruled by backwards compatibility: The code that's
*been written* is fixed in size. The code that *will be written* is unbounded.
Always take an opportunity to set things right. Too much adherence to
backwards compatibility (Zetalisp, in particular) is one of primary reasons
for most of Common Lisp's current uglinesses -- yet Zetalisp has vanished
into the mists. OK, that's my flame.

    I also oppose "dropping" of standard Scheme procedures. Does this mean
    that to claim support for SRFI-13 (or SRFI-1) my implementation should
    make them unavailable? If not, then why mention it at all? I believe the
    whole idea belongs to a separate "Scheme Request For Non-implementation"
    process (SRFN-0: drop TRANSCRIPT-ON and TRANSCRIPT-OFF).

They aren't dropped from Scheme. They just aren't in this SRFI. This SRFI
is a specific set of bindings; if you write code that uses this SRFI, you
use what it provides; no problem. If you want stuff that isn't in this SRFI,
then import that stuff, too. No problem.

    I believe that <> convention for "not equal" is not the best choice: it
    doesn't look right when applied to domains with no natural order
    (i.e. MY-RECORD<>).  Other possible choices are ~=, /=, ^=, and != (all of
    them are subpar, but if forced to make a choice, I would pick ~=).

I see your point... but I'm going to stick with <>. None of the other choices
seem all that much better to me (and I don't have any better suggestions,
myself).

    ;;; Mismatch index in string-comparison procedures:

    Why do we need this mismatch index in each and every comparison
    procedure? The fact that it can be returned doesn't necessarily
    mean that it should; if somebody's procedure ends in (string= s1 s2)
    and I have to rewrite it, say, using lists instead of strings,
    I don't want to scan all the code looking for call sites to
    check if the return value is just used as a boolean (imitating
    mismatch index in my new code may be problematic). I will
    have all this imaginary trouble because INTENTIONS of the
    original author were not clear. This, however, didn't stop
    the designers of MEM*, ASS* and dozens of other functions,
    but in many cases I can see some advantages in returning
    non-#t value. But, judging from my modest experience, I beleive
    that in regular lexicographical string comparison the
    mismatch index is NEVER needed. It is definitely less
    useful in practice than n-ary string comparison predicates;
    and the unfortunate fact is that STRING{>...} returning a
    mismatch index cannot be generalized to n-ary case!

It turns out to be a handy value to have around if you are comparing
strings. The string compare functions produce it, so we can essentially export
it to clients for free. "Don't hide functionality."

(By the way, you *can* generalise mismatch indices to n-ary comparisons, but
it seems to be less useful.)

    SUBSTRING-COMPARE{-CI}: is mismatch index (!?) relative or absolute?

You get an index into the string itself (absolute); it's relative to the
beginning of the string, not the beginning of the selected substring
segment. I've added a little text to the SRFI to make this clear.

    STRING-ITER: why not STRING-ITERATE?

Why not, indeed? I'm flexible (I'm ever so slightly in favor of -ITER, because
I'm a brevity freak, but can definitely see the merits of -ITERATE). How do
people feel about this one?

    {SUB}STRING-{PRE,SUF}FIX-COUNT{-CI}: why not -LENGTH- instead of -COUNT-?
    In both CommonLisp and SRFI-1, COUNT is associated with selective
    counting, not measuring the length of contiguous subsequences.

No strong reason; I believe these names came over from MIT Scheme. We
could change. May I have other opinions on this matter?

    SUBSTRING{-CI}? : why do these names end in question mark? They
    behave more like SUBSTRING-INDEX{-CI} and other MEMQ-like procedures.

1. You typically use them as boolean predicates.
2. The name SUBSTRING is already being used.

    ;;; proposed additions (actually these were defined in R^2RS):

    (substring-move-left! string1 start1 end1 string2 start2)
    (substring-move-right! string1 start1 end1 string2 start2)

Got you covered; it's already there. See STRING-COPY!.  Here's the code; you
can see that no right/left variants are needed -- it does the right thing in
all cases.

(define (string-copy! to tstart from . maybe-fstart+fend)
  (string-let-start+end (fstart fend) string-copy! from maybe-fstart+fend
	(let ((tend (+ tstart (- fend fstart))))
	  (check-substring-spec string-copy! to tstart tend)
	  (if (> fstart tstart)
	      (do ((i fstart (+ i 1))
		   (j tstart (+ j 1)))
		  ((>= i fend))
		(string-set! to j (string-ref from i)))

	      (do ((i (- fend 1) (- i 1))
		   (j (- tend 1) (- j 1)))
		  ((< i fstart))
		(string-set! to j (string-ref from i)))))))

    From: Lars Arvestad <xxxxxx@nada.kth.se>
    I noticed the procedure pair string-for-each and string-iter [*]. My
    suggestion is that string-iter is dropped and that string-for-each is
    made do iterate from START to END. This is consistent with the list
    version of for-each, which is guaranteed to apply PROC in
    order. If a procedure can be applied in any order, it can't be a
    restriction to apply it in order, thus an unordered version of
    string-for-each is unnecessary.

    Are there efficiency/implementation advantages to have an unordered
    version? In that case, another name than string-for-each seems
    appropriate.

STRING-FOR-EACH is allowed to go right-to-left. Going right-to-left means the
index end-test is for zero, not the length of the list. Testing against zero
is faster and doesn't require you to hold the string-length in a register
across the iteration. Call me an efficiency freak. (And, yes, I know:
induction variable elimination nukes the zeroness of the r-to-l scan.
But no one does it in Scheme, so it doesn't really buy much these days...)

As for STRING-REDUCE & STRING-REDUCE-RIGHT: I defined them, but the
definitions made it clear that the functions were essentially useless.
So I have dropped them. Others may redefine & repropose them, if they
wish. Note that for reducing (in the sense of SRFI-1 list-lib) over strings,
the function F must be a char x char -> char function, and the right identity
value must be a character. I cannot come up with a single thing one would ever
want to do with such a function.

I have made two changes to the library:
  - I've added a small efficiency hack to the definition of
    REVERSE-STRING-CONCATENATE
  - I've added two support procedures for general Knuth-Morris-Pratt
    string search. These procedures make it easy to do KMP search
    through arbitrary character sources such as input ports.

    PS: Olin: Thanks once again for what looks like an excellent draft.

You bet! Thanks for the excellent feedback, y'all; I appreciate it.
    -Olin