Re: SRFI-13 string library round 3 -- request for comments & reviews David Rush 03 Dec 1999 10:16 UTC

I know that I'm late coming into this discussion, but I just had to
get a few comments in on the status of SRFi-13. First and foremost, I
pretty much like it. However...

Reading over it, I get a sense that there is a case of rampant
featuritis happening here. String utilities are *so* pervasive
that everyone has opinions and preferred interfaces. IMNSHO,
the SRFI could lose:

	- (Don't shoot me for this) all of the sub-stringiness. See
	  below for a justification.

	- char-set params; wrap 'em up in your own lambda, please. Is
	  it really more efficient to do this in the library, when
	  R5RS doesn't support the data type anyway?

	- functions of marginal utility, I know that this is a matter
	  of diverse opinion, but I suspect that there are more than a
	  few functions in there that only *rarely* get used by
	  anyone. My candidates are included below.

Re: losing substringiness

I agree that this is desirable functionality, I just really think it
needs to be spec'ed elsewhere (SRFI-N). Shared-substrings are a
separate data type in other languages for good reasons (I can't speak
to Guile's issues, I use Guile, but I didn't even know they were
there, the documentation *is* being rewritten), the simplest being
that they have different lifecycle issues from 'full'
strings. Besides, there are bunches of interfaces in SRFI-13 which
would be a lot cleaner without [start end] and the indirect binding of
substring to params e.g. (f s1 s2 start1 end1 start2 end2).

Of course, because Olin is very clever, this will be mostly invisible
to the casual user of the library. I just feel very strongly that this
can be better treated in a shared-substring SRFI, which will probably
never get written if most of the functionality is already avaiable in
this one.

Re: functions of marginal utility

In thinking about this list, I would have to say that it is prejudiced
by virtue of the fact that I use strings as grey boxes for
human-directed information. If you're using strings as a poor man's
byte-vector then you may indeed want some of these rather more than I
do. I would again ask the question: Does this functionality fit better
as part of a different package?

string-map - In 15 years of programming, I can't think of *once* where
	I have used something this general. string-[up/down]case, yes,
	generalized map, no. Not a big deal, but exemplary.

string-fold/unfold - these are actually cool functions. One constructs
	a data structure from a string, and the other builds a string
	from a data structure. I just can't get over the nagging
	feeling that these are more about string parsing than about
	string manipulation. And Olin suggested a parsing SRFI...

string-tabluate - ??? I grok it, just can't see the utility

string-for/do-each - No great complaint, I just think that since
	you're almost certainly doing stateful processing inside the
	loop that your code will probably be more readable with a
	coded loop than with a function call. OTOH, this does present
	nice possibilities when coupled with call/cc.

string-compare - how much does this add to string-{pre/suf}fix? I find
	it to be fairly cool (handling ordering relationships is one
	of my pet programming peeves), I just don't think I'd ever end
	up using it.

string-capitalize - I agree with Olin. Ditch it. It has too much too
	do with natural language rules, and not enough to do with
	string manipulation.

string-{filter/delete} - I feel much the same way about this as I do
	about string-map.

Re: string-tokenize/string-split

>    From: xxxxxx@pobox.com
>
>    Suppose we have a string 'str' consisting of tokens separated
>    by a #\: character. We can extract the tokens using either
>
>        (string-tokenize str
>                (char-set-difference char-set:full (char-set #\:)))
>    or
>        (string-split str (char-set #\:))
>
>    the two procedure calls above are indeed roughly equivalent;
>    therefore, a String library should define only one of them.

Well, that does not appear to be a guiding principle from my read of
the SRFI document.

>   It indeed appears that some problems lend themselves to
>   delimiter-based parsing while the others do to the inclusion
>   semantics.

This is the telling argument. As you may have guessed I am in favor of
this proposal. In fact, I would go sa far as to say that the delimited
approach should be retained and the inclusion approach abandoned;
inclusion is both better & more readily addressed via regexp's. The
inclusion case frequently involves far more specific subsets than the
standard available char-set:*s, and using them will require complex
composition of char-sets involving more overhead.

OTOH, regexp's are generally less 'dynamic' as in you generally don't
recompute them on the fly, which you could more easily do with
string-tokenize's char-sets.

> Unicode is important! But... one possible reply is that Gambit's
> char-set implementation needs to be improved.

This seems to be the tail wagging the dog.

> You're *still* right in your larger point that I could split at colons more
> easily with a string-split. But even after reading your comments, when I sit
> down to try and design a procedure or two that does the basics, I still go
> helplessly sliding down a slippery feature slope.

I believe it. I think that this might simplify by eliminating the
sub-string features of this SRFI (which I advocate for other reasons,
anyway).

> You *have* to allow control of the delimiter grammar -- separator,
> terminator, prefix

No, you don't. This is not necessarily an inverse of string-join,
although it would be nice. I'd estimate that 99% of the cases involve
infix splitting. Terminator splitting is a trivial special case. of
infix. I'm not sure that prefix splitting would account for even 0.1%
of all the cases.

> START/END indices?
Ditch 'em.

> If we are going to quit early (via MAXSPLIT), we need a way to tell
> the client how far into the string we got.
No. It's already there in Oleg's proposal bacause the last returned
string in a MAXSPLIT case is all the remaining text.

> On the other hand, I'm not happy with returning the rest of the
> string as a final element of the return list.
Why?

> One of these things is not like the other...
What thing is not like which other?

> Not to mention that it requires copying the data.

Only if you already had to copy it in the first place.

Well, OK, here's my best shot (which is not too good) (plagiarized
from Olin and twisted to my other prejudices as outlined above):

(string-split s char/predicate [grammar max-tokens]) -> string-list

- GRAMMAR is 'infix, 'suffix, 'prefix, or 'strict-infix
  Defaults to 'infix.
- MAX-TOKENS is an integer saying "quit after this many tokens"; #F means
  infinity. Defaults to #f. Oleg's convention (last element of list
  has remaining text) applies

> - ELIDE-DELIMS is boolean, meaning runs of delimiters count as
>   single delimiter. Defaults to #t.

Overkill, and it violates inverse-ness for string-join. The only case
we want it for (in general) is handling white space. I could take it or
leave it, but it should be the last parameter (most specific
case). Now that I'm thinking, I'd probably take it in favor of
string-split-ws...

> This is powerful. It's good to have an inverse for STRING-JOIN.
Yep.

> It's a heck of a lot of parameters. Does anyone besides Oleg want to
> push for it?
Count me in.

david rush
--
A camel is a horse designed by committee...