Round 3 discussion & open issues shivers@xxxxxx 22 Nov 1999 02:10 UTC
Re: Round 3 discussion & open issues d96-mst@xxxxxx 23 Nov 1999 21:18 UTC

Round 3 discussion & open issues shivers@xxxxxx 22 Nov 1999 02:10 UTC

Here is a list of current issues & changes & requests for comments for the
SRFI as it now stands. I've quoted relevant email from others that has raised
particular issues. All changes I've made since I put out the "round 2"
messages are described herein, but you can refer to the latest SRFI draft
for the specifics of any given proc.

I am *very* appreciative of all the valuable reviews people have provided.
Thanks for taking the time and thought; I think the current draft has been
significantly improved from my initial proposal. I welcome further comments
on the current draft as it now stands.

Here's a very brief summary of recent changes; read the body of this msg
for details:


    - CAPITALIZE-WORDS dumped.

    - Twenty-two SUBSTRING-... procs killed; companion STRING-... procs
      generalised to handle double start/end optional indices.
      This is the single biggest change, I'd say. Please review.

    - SUBSTRING? replaced by STRING-CONTAINS for complex reasons pertaining
      to the previous change.

    - right-to-left search procs (STRING-{INDEX,SKIP}-RIGHT) now take
      their start/end indices just like everyone else.

      SRFI-13 now has *no* backwards incompatibilities with R5RS bindings.

    - STRING-FOR-EACH guaranteed left-to-right.
      STRING-ITERATE gone; STRING-DO-EACH is order-unspecified.

    - STRING-JOIN now accepts 'prefix grammar.

    - I'm looking for someone to send me Boyer-Moore search code. C'mon; get
      your name into the acknowledgments section...

    - I might add a too-complex STRING-SPLIT proc if people push for it.

    - CHECK-SUBSTRING-SPEC (which raises an error) now accompanied
      by predicate SUBSTRING-SPEC-OK?.

    - STRING-PARSE-START+END proc s args -> [rest start end]
      Now returns the rest value as the first return value. No one
      cares. Don't worry about it.

Long-suffering Mike "Judean People's Front" Sperber will very quickly move my
latest draft from
to the official SRFI-13 site at no one should go looking at the former URL.

Or was that the People's Judean Front?

    From: erik hilsdale <>

    The procedure-names in the library obey a very nice naming convention:
    [sub]string-<verb/noun>[-ci][!] for the most part.

Yep, that is by design.

    This is why I was surprised to see some breaking of these rules.  For
    consistency, I would strongly support renaming the following procedure:

	join-strings       ==> string-join

Hmm... the stick-with-the-convention-dammit rule is a powerful one.
I would counter by saying that the lexeme isn't "string", it's "strings",
hence the convention doesn't really apply; you mutated the name to get it
into the ballpark.

I'm a little uncomfortable with the shift, but I guess it's the right thing.
I'm changing over to Eric's STRING-JOIN. If people hate this, please lobby
me and get me to change it back.

    From: erik hilsdale <>
    I would also support renaming these two

	    capitalize-string  ==> string-capitalize
	    capitalize-string! ==> string-capitalize

    except that would cause a strange discontinuity with
    capitalize-words[!].  So I probably wouldn't rename these two unless I
    could convince everybody to drop capitalize-words[!] entirely, which
    I'll try to do below when I get to individual procedures.

    ---- capitalize-string[!], capitalize-words[!]

    I just don't see a reason for having these procedures in this library,
    _especially_ capitalize-words.  They seem much more suited for a
    'natural language' sublibrary or the like.  I'm not sure I can argue
    cogently why they don't belong, but they stick out a bit to me, where
    string-upcase and string-downcase don't.  Capitalize-string[!] I can
    see as somewhat useful, but capitalize-words[!], for me, opens the
    same kind of bag of worms that string-split does.

On reflection, I can't think of cases where people are really going to
be dying to use CAPITALIZE-STRING[!]. Should I bag it? Eric says yes.
I am leaning that way. Further opinions?

Unless I'm subsequently persuaded otherwise, I'm punting CAPITALIZE-WORDS,
which means I can go ahead and rename CAPITALIZE-STRING to STRING-CAPITALIZE.

I'm going to leave the code in the reference implementation, commented out.
I mean, I wrote it & debugged it... someday someone will want it.
    From: erik hilsdale <>
    The rule that is almost always followed is that if there is a _single_
    string argument to a function, there will be optional start/end
    arguments for that string.  [...]

    However, although adding the optionals to string-<take/drop>[-right]
    seems goofy, I'll ask for it anyway.

    Because my Scheme doesn't have shared substrings, when I start passing
    around start/end pairs to my functions, I do so for efficiency, and I
    really do treat the three arguments 's start end' as one:

      (define foo
	(lambda (s start end) ;; one conceptual argument

Right! That's one of the stated design goals -- to allow you to get the
effect of shared-text string hacking with consistent start/end indices.

In my opinion, your design heuristic -- "I don't want to think" -- is an
extremely good guideline.

On the other hand, if you *are* going to be coding with start/end indices,
then forget TAKE & DROP. Just use SUBSTRING. It's really just as convenient.
Why write
   (string-take s n start end)
   (substring s start (+ start n))
is equivalent, modulo the fact that the bounded-take operation would
raise an error in some cases where the explicit SUBSTRING would not.

The TAKE and DROP ops are convenience functions, for certain patterns of
SUBSTRING calls. As soon as you go to substring indices, their convenience
worth goes away.

    From: erik hilsdale <>
    ---- Mandatory 'start2 end2' arguments for multi-string procedures

    All procedures that take more than one string, if they accept
    start/end pairs for those strings, _require_ all start/end pairs.
    That is, none of them is optional.

    Except for string-replace, where [start2 end2] is optional. .  This is
    a bit strange.  Certainly, substring-compare[-ci] can't have optional
    [start2 end2] arguments, since the continuation arguments come last.
    But what would be wrong with

    (substring= mystr mystr-start mystr-end "foo")

    So it seems to me that the domains of
    should be changed from
       s1 start1 end1 s2 start2 end2
       s1 start1 end1 s2 [start2 end2]

I have taken a more radical step. I've shifted both pairs of start/end indices
for all two-string procs to the end of the list. This has a *big* effect:
no more substring-variants! Kills many, many procedures from the library.
For example, instead of both
    string= s1 s2
    substring= s1 start1 end2 s2 start2 end2
we now have only
    string= s1 s2 [start1 end1 start2 end2]

Here are the procedures that get killed
    substring-compare	substring-compare-ci

    substring= 		substring-ci=
    substring<>		substring-ci<>
    substring< 		substring-ci<
    substring> 		substring-ci>
    substring<=		substring-ci<=
    substring>=		substring-ci>=

    substring-prefix-length	substring-prefix-length-ci
    substring-suffix-length	substring-suffix-length-ci

    substring-prefix?	substring-prefix-ci?
    substring-suffix?   substring-suffix-ci?

Here are the procedures that get correspondingly extended:
    string-compare    s1 s2 proc< proc= proc> [start1 end1 start2 end2] -> values
    string-compare-ci s1 s2 proc< proc= proc> [start1 end1 start2 end2] -> values

    string=  s1 s2 [start1 end1 start2 end2] -> boolean
    string<> s1 s2 [start1 end1 start2 end2] -> boolean
    string<  s1 s2 [start1 end1 start2 end2] -> boolean
    string>  s1 s2 [start1 end1 start2 end2] -> boolean
    string<= s1 s2 [start1 end1 start2 end2] -> boolean
    string>= s1 s2 [start1 end1 start2 end2] -> boolean

    string-prefix-length    s1 s2 [start1 end1 start2 end2] -> integer
    string-suffix-length    s1 s2 [start1 end1 start2 end2] -> integer
    string-prefix-length-ci s1 s2 [start1 end1 start2 end2] -> integer
    string-suffix-length-ci s1 s2 [start1 end1 start2 end2] -> integer

    string-prefix?    s1 s2 [start1 end1 start2 end2] -> boolean
    string-suffix?    s1 s2 [start1 end1 start2 end2] -> boolean
    string-prefix-ci? s1 s2 [start1 end1 start2 end2] -> boolean
    string-suffix-ci? s1 s2 [start1 end1 start2 end2] -> boolean

The argument order in
    string-replace s1 s2 [start1 end1 start2 end2] -> string
also has to be altered to the new way of things.

Something is lost -- there is a clarity benefit in putting the indices
right next to the string to which they refer -- but some things are
    - We eliminate 22 procedures from the library! Fantastic.
    - Generality in calling conventions.
The clarity issue is something to consider, because if you write
    (foo s1 s2 start end)
in the new world, the START/END parameters apply *not* to S2 (where they
appear), but to S1.

Another negative of this shift is that SUBSTRING? has to have its
arguments altered to take both s1 *and* s2 start/end indices -- the old
version only took s2 (target) indices. Here is the new version:
    substring?    s1 s2 [start1 end1 start2 end2] -> integer or false
    substring-ci? s1 s2 [start1 end1 start2 end2] -> integer or false
Unfortunately, I think it's much more common to search for the occurrence of
a complete string within some substring than vice-versa, so you have to
write the 0/(string-length s1) args down to be able to specify the selected
search span in S2:
     (substring? "foo" s2
                 0 3		; Redundant,
		 start2 end2)	; so we can specify these.
So I am going to punt this procedure in favor of
    (string-contains s1 s2 [start1 end1 start2 end2]) -> integer or boolean
This flips the parameters -- searches for an occurrence of S2 in S1. So the
common case becomes
    (string-contains s1 "foo" start1 end1)
Finally, there's a not-so-obvious advantage to this shift: SUBSTRING? is
a misnomer, because it returns an integer for its true value. The usual
convention is that these sorts of procedures don't end in "?". But we
*can't* observe this convention in this particular case -- SUBSTRING is
already in use. We *can* implement this convention with STRING-CONTAINS.

    From: Jonathan Sobel <>

    * [end start] versus [start end]: I definitely vote for keeping the
      same order as everywhere else.  I know there are reasons to reverse
      it, but I won't remember them while I'm programming.

    From: erik hilsdale <>
    ---- Optional [end start] arguments for index and skip

    Because the optional [start end] convention is so firmly established
    in the library, it hurts me dearly to see [end start] in the
    index-right and skip-right procedures.  Olin's defense that the start
    argument is almost never specified for these doesn't sway me, and the
    fact that I'll get a runtime error when I try it just irritates me.

    When I'm using the 's start end' convention to represent my strings,
    I'm not going to remember that these two arguments need to be reversed
    for these two procedures, and if I do I'll curse the fact that I've
    devoted brain cells to the exception.

    When I'm not using this convention and just using the optional
    arguments 'casually', I at least would be happy to always write a zero
    for my start point whenever I want to specify my end point.

OK, OK. Done. See the SRFI.

At least the redundant thing is a short constant (0), instead of
a (string-length s) argument.

    From: erik hilsdale <>
    ---- Overflows

    Questions about some overflow cases:

    (string-take "foo" 5) ==> "foo" or error?
    (string-drop "foo" 5) ==> "" or error?

    (string-copy! "xx" 0 "yyy") ==> I assume an error, but am worried by
				    the language
				    'the copy is guaranteed to work'.

In these cases, we freeze your assets, delete your home directory & shoot your
dog. All errors. I have added text to this effect to the SRFI.

    From: erik hilsdale <>
    ---- Sharing

    I'll cast my vote for the liberal camp.  However, I feel obligated to
    get a little wacky about it and ask why we're providing


    If we're breaking R5RS's substring, I see no reason not to also break
    R5RS's string-append.  That is, by voting 'liberal' I believe that

	    (eq? foo (string-append "" foo))
	    (eq? foo (string-concatenate (list "" foo)))
	    (eq? foo (reverse-string-concatenate (list "" foo)))

    should all be allowed to be true.  The problem with my firm stance is
    that I have no idea what the 'non-shared' versions should be named.
    string-append/copy?  string-append/fresh?

    From: Jonathan Sobel <>

    * I vote liberal on the copying issue.  In fact, I go along with Erik
      and say we ought to be able to assume sharing unless a specific copy
      is requested.  I'd be willing to drop the multiple versions of the
      procedures with "/shared" or "/fresh" or whatever and do an explicit
      "string-copy" if I want something new.  If my compiler has already
      made a copy, it can optimize my request away.

OK, I wimped out & went conservative. This library does *not* contain any
appropriate.  SRFI-13 now has *no* backwards incompatibilities with R5RS *at

For summary, here is a list of the procedures that are allowed to return
values that share storage with their parameters. As the comments indicate,
it should be straightforward to program in either a style that permits
sharing or requires fresh copies:

    string-take string-take-right
    string-drop string-drop-right
        (For fresh-copy semantics, use STRING-COPY instead of these procs.)

    string-append/shared string-concatenate/shared
	(Fresh-copy variants all exist.)

    string-pad   string-pad-right
    string-trim  string-pad-right string-pad-both
    string-filter string-delete
        (No equivalent guaranteeing fresh copy; not deemed worth it.
	 For fresh copies, wrap a call to STRING-COPY around the application.)

    From: erik hilsdale <>
    ---- string-for-each, string-iterate

    I'm definitely with Lars Arvestad on this one.  String-for-each should
    require a left-to-right ordering.  I mainly argue this because I've
    now sent approximately five years-worth of undergraduates out into the
    world with the factoid that they should use 'for-each' when they care
    about evaluation order.  I'm willing to pay the price of an extra
    register-register comparison for that clarity.

    Well, I think I'm willing to do that, anyway.  I'd be happier if there
    were a good, non-'for-each' identifier that would connote
    unorderdness.  Sigh.

    From: Jonathan Sobel <>
    * "for-each" means left-to-right.  What about "string-apply" for the
      thing that just applies the given procedure to all the characters,
      in no particular order and with no particular result?

    From: (Mikael St��ldal)
    * I think that string-for-each should work form left to right and that
    no other string-iter[are] procedure should be included.

OK, backwards compatibility takes this round. STRING-FOR-EACH is
left-to-right.  How about STRING-DO-EACH for the variant that doesn't specify
order?  You're going to see this lexeme again in vector, set, etc. SRFIs, so
let's figure one out with which we can live, eh?

Got a better name for unordered application than -do-each? Send it in.
The T guys tended to use "-walk", as in
But their -walk procedures frequently take key *and* value parameters (e.g.,
the hash-table one), so perhaps we should save this lexeme for that purpose.

    From: erik hilsdale <>
    ---- string-null?

    I worry (probably needlessly) about people confusing the 'null' lexeme
    with the character with ascii value 0 that C programmers use to
    terminate their strings.

    Or even worse, someone confusing it with the terminal value of a
    recursive datatype (grin).

    I was going to argue for 'empty-string?', but that would conflict with
    the 'string-first' naming convention.  How about 'string-empty?'?

    From: Jonathan Sobel <>
    * I vote for "string-empty?" instead of "string-null?".

I could go with STRING-EMPTY? or STRING-NULL?. I'll stay with STRING-NULL?
for the tiny tie-breaker reason that it's one column shorter. If I get more
support for STRING-EMPTY?, I'll make the change. For now, I'll leave things

    From: erik hilsdale <>
    ---- join-strings

    Why no 'prefix grammar?

I've never seen anyone use one. But what the heck; why not. It's a *trivial*
addition to the code, and it gives generality without adding to the size of
the API. Of marginal utility, but free. And someone, sometime will want it.

    From: (Mikael St��ldal)
    * I don't see the point in [sub]string-compare[-ci]. Can anybody give a
    sensible example of their use?

Sure, here's the first one that came to mind: there's an optimisation of
quicksort wherein you partition the vector not into *two* regions (the <=pivot
left region, and the >=pivot right region), but *three*: a <pivot left region,
an =pivot middle, and a >pivot right region. Then you only recurse on the left
& right regions. You need a non-boolean comparison function for this variant
of quicksort -- it has to return <,=,> on a given comparison.

(I leave the coding of this three-way partition as an exercise; it's kind of

    From: (Mikael St��ldal)
    * I think that the string-trim and string-pad procedures should be
    named string-trim-left and string-pad-left. I don't think that left is
    an obvious default for trimming and padding (as it is for string-index,
    string-skip etc).

I agree -- left is *not* the obvious default. But it doesn't matter.
Got to stick to the convention we established with much anguish and
sweat in SRFI-1: the default direction is left-to-right, and gets the base name
with no further directional lexeme; right-to-left directional variants are
given names with a -RIGHT suffix. Simple rules so Eric won't have to think.

    From: (Mikael St��ldal)
    * I think that the let-string-start+end macro should be left out of the
    standard. It's easily enough to do it anyway with the
    string-parse-start+end procedure. And it might create problems to
    include a macro.

Lots of SRFIs define macros. No sweat.

    From: (Mikael St��ldal)
    I think that the KMP searching procedures should be generalized to
    allow for other algorithms than KMP. But only to algorithms that have
    the same properties as KMP, i.e. scanning a stream without
    backtracking. This excludes Boyer-Moore, but allows for some recent
    algorithms that are said to be faster than KMP[1].

Here's the price of admission. Someone has to
       - Send me working B-M Scheme code, under public domain, GPL,
         Berkeley-style license, something along these lines.
       - Explain in the comments or the email how B-M works, so I
         can understand the code.
Then I'll do it. I'll humbly admit: I originally considered what you propose,
but really don't know beans about string-search algorithms, besides KMP. So I
didn't do it.

    The interface could be something like this:

    make-search-object c= s [start end] -> opaque "search object"
	Returns an opaque "search object" that can be used to search
	for S in a stream/string. The search object includes S itself
	if nessesary (nessesary for KMP, but not for [1]).

    search-step search-object c= c search-state -> bool or opaque "search state"
	Performs a step in the search, taking an opaque "search state"
	object and returns a new "search state" object. Use '() as search state
	for the first time.

    Or why not like this:

    make-search-proc c= s [start end] -> search procedure
	Returns a search procedure that can be used to search
	for S in a stream/string.

    <search-proc> c -> bool
	Performs a step in the search, updating an internal search state.

    <search-proc> '() -> '()
	    Reset the internal state of the search procedure.

    [1] Sun Wu, Udi Manber: "Fast text searching allowing errors",
    Communications of the ACM, October 1992.

There are some problems with this spec. The search-step function is an
inner-loop kind of utility. That's why I made people pass in the string
explicitly with KMP-STEP -- you don't have to unpack it from some record
structure, and you can win better if you integrate/beta-substitute KMP-STEP.
Also, KMP-STEP doesn't have to allocate heap storage on every call, to
make a new search state. That's just not acceptable for quickly searching
a string.

The MAKE-SEARCH-PROC spec has other problems.  For one, it's fundamentally
side-effecting. This way does not lie efficiency. Nor thread safety, which
is a really big flaw.

In spite of these tactical issues, your larger point is an excellent one.  So
I'll tackle this generalisation, if I can -- but not until I see Boyer-Moore.


    Suppose we have a string 'str' consisting of tokens separated
    by a #\: character. We can extract the tokens using either

	(string-tokenize str
		(char-set-difference char-set:full (char-set #\:)))
	(string-split str (char-set #\:))

    the two procedure calls above are indeed roughly equivalent;
    therefore, a String library should define only one of them. Let's
    assume however that the string 'str' in question is in Unicode. It's
    not that far-fetched; Gambit and Kawa for example support Unicode. It
    appears that the two above procedures would not be equivalent as far
    as computing resources and efficiency are concerned. Gambit, for
    example, implements char-sets internally to parse input. Low-ascii
    characters are marked in a vector, while other characters are in a
    list. In this implementation, char-set:full, or char-set:full without
    one character are quite large (given the full Unicode). It may take
    quite a while to do a membership test in such a set. In contrast,
    (char-set #\:) is far smaller and more efficient. It indeed appears
    that some problems lend themselves to delimiter-based parsing while
    the others do to the inclusion semantics.

Unicode is important! But... one possible reply is that Gambit's char-set
implementation needs to be improved. For example, if the char-set
implementation was a bit-vector for Latin-1 and a run-length-encoded vector
(or tree) of spans for the high chars, we would crush this particular
split-at-colons task, right? You'd have the high chars all included in one

You're *still* right in your larger point that I could split at colons more
easily with a string-split. But even after reading your comments, when I sit
down to try and design a procedure or two that does the basics, I still go
helplessly sliding down a slippery feature slope. You *have* to allow control
of the delimiter grammar -- separator, terminator, prefix. START/END indices?
If we are going to quit early (via MAXSPLIT), we need a way to tell the client
how far into the string we got. I'm uncomfortable with functions that return
one value sometimes, and two values other times (though I've done it on
occasion).  On the other hand, I'm not happy with returning the rest of the
string as a final element of the return list. One of these things is not like
the other... Not to mention that it requires copying the data.

The particular functionality encapsulated in the STRING-SPLIT &
STRING-SPLIT-WS seems arbitrary. But I'm not smart enough to do

Note that white-space splitting is already easily handled:
     (string-tokenize s char-set:graphic)
This doesn't handle your MAXSPLIT feature, but otherwise gives

MAXSPLIT is the most useful thing about your proposal; there's no analog
in STRING-TOKENIZE, and it's a handy bit of functionality. But I'm going
to let this drop because I just can figure out how to design this properly.

Well, OK, here's my best shot (which is not too good):
    (string-split s [cset/char/predicate grammar elide-delims
                     max-tokens start end]) -> [string-list i]
- CSET/CHAR/PREDICATE defaults to char-set:white-space
- GRAMMAR is 'infix, 'suffix, 'prefix, or 'strict-infix
  Defaults to 'infix.
- ELIDE-DELIMS is boolean, meaning runs of delimiters count as
  single delimiter. Defaults to #t.
- MAX-TOKENS is an integer saying "quit after this many tokens"; #F means
  infinity. Defaults to #f.
- I is where to resume parsing if we quit early. the default parse is tokens separated by runs of white-space.

This is powerful. It's good to have an inverse for STRING-JOIN. It's
a heck of a lot of parameters. Does anyone besides Oleg want to push for it?
I'll put it in if there's support; the inverse-pairing-with-STRING-JOIN
argument is a good one.

	    BTW, the same approach -- define the most generic and a few
    most common procedures -- applies equally to string->integer.

    > This is a can of worms. string->integer is undoubtedly useful. But so
    > is string->floating-point. What about base? Return #f or raise an
    > error on bad syntax?

    Well, if someone wants to get a floating-point number, or deal with
    base, bitstrings etc., he should use string->number. string->integer
    is only to parse a string of digits. Period. It's a one-trick
    pony. string->integer never raises any error: if it does not like
    something it returns #f.

    Speaking of string->number, do you want to make it accept (start, end)
    arguments? This could be quite convenient...

Right you are! It drives me crazy when I'm parsing numbers out of strings
to have to copy storage just to do a numeric parse. But I'm going to leave
string->number alone; see below.

	    Why is string->integer singled out, of all the possible
	    - Because, IMHO, dealing with a sequence of digits
    is quite common;
	    - Because this conversion can be made rather efficient
    (base 10 can be hardwired; multiplication by 10 can be done with two
    shifts and an addition).
	    - Because it keeps one from surprises. For example, I had to deal
    with forms made of groups of digits. Occasionally some other characters
    may crop up; in which case I had to report an error intelligently.
    I wrote
	    (let ((val (string->number token)))
	      (and (integer? val) (do-deal-with val)))
    Imagine my surprise when the token happened to be "1S2" yet the val passed
    the integer? test. Eventually the faulty token triggered another check
    and the error got reported, but recovery became more complex. Also,
    imagine what happens given the token "1/0". You're either at the mercy
    of your implementation's exception system, or you have to prescan the
    token to make sure it is indeed made of digits.

Wrong tool for the job. In this situation, I'd suggest doing a syntax
check *before* attempting conversion. I.e.
      (if (string-every char-set:digit token)
          (string->number token)
          (deal-with-bogus-token token))
...or using a regexp matcher to check out the token, if you want to admit a
more complex grammar.

Hmm. I *do* think an atoi() analog is a handy thing. But perhaps this
should be left to a whole little parser package, or to a numeric package --
datatype operation sets should include parser/unparsers, right? I am not
satisfied with R5RS's numerics, either (I'd like to see ++, --, <>,
two-value divmod, and two-argument floor, ceiling & truncate -- so perhaps
a numeric SRFI with numeric parser/unparsers is in order). Can we table
atoi, itoa, atof, ftoa to that day? Then we could really tackle the issues
of base, start/end indices, zero/blank padding/field-width, and error

Richard Kelsey has proposed the lexeme "replication" for the xsubstring
and string-xcopy! procs. So here are some choices:

    xsubstring		replication-substring		substring/replication
    string-xcopy!	replication-string-copy!	string-copy!/replication

I dunno. These are complex procs; there's not going to be a clear, simple name
like "append" or "reverse" that covers them, so that issue is not really on
the table. I'm going to leave them as is.