I have altered the SRFI to make the following changes:
- "prefix-count"/"suffix-count" lexeme changed to
"prefix-length"/"suffix-length"
These are now these
------------------------- --------------------------
string-prefix-count string-prefix-length
string-suffix-count string-suffix-length
string-prefix-count-ci string-prefix-length-ci
string-suffix-count-ci string-suffix-length-ci
substring-prefix-count substring-prefix-length
substring-suffix-count substring-suffix-length
substring-prefix-count-ci substring-prefix-length-ci
substring-suffix-count-ci substring-suffix-length-ci
- string-comparison functions now return a simple boolean
string<> string= string< string> string<= string>=
string-ci<> string-ci= string-ci< string-ci> string-ci<= string-ci>=
substring= substring<> substring-ci= substring-ci<>
substring< substring> substring-ci< substring-ci>
substring<= substring>= substring-ci<= substring-ci>=
Note that these comparison functions still return a mismatch index:
string-compare substring-compare
string-compare-ci substring-compare-ci
What follows is general discussion and replies to msgs from Oleg & Dan.
-Olin
-------------------------------------------------------------------------------
From: xxxxxx@pobox.com
If I may I'd like to propose two more functions,
string->integer and string-split.
A general STRING-SPLIT is just too complicated for me. Here are the variants
we want to support:
- Variant grammars, e.g. tolerant-infix, strict-infix, and suffix
- Optional substring indices
- Number of fields to parse. You might want
- as many as exist,
- exactly N, or error,
- at least N, or error.
- Do contiguous runs of delimiter chars make a single delimiter, or
to they designate empty-string tokens?
Scheme makes it difficult to have different, independent sets of optional
args, since you have to order them. The "field parser" utilities I wrote
for scsh's awk utility (see the scsh manual) handle all this complexity,
and more -- you can specify tokens or delimiters with general regexps, not
just char sets. But this is much hairier machinery than I feel is appropriate
for a basic string library.
I dodged at least the grammar issue with STRING-TOKENIZE by having you
specify not the separator chars but the token chars -- contiguous runs
of token chars make a token. End of story. And I just punted the
number-of-fields issue, leaving only the substring indices as possible
optional args, so things worked out -- given my low ambitions.
Some problems
are more elegantly and efficiently expressed in terms of inclusion,
some other are in terms of delimiting. I found for example that in
Perl and Python split() is a rather often-used function.
Yeah, perl hackers use split() a lot, for sure. But the char-set SRFI provides
a CHAR-SET:GRAPHIC set, which makes it as easy to use STRING-TOKENIZE to pick
out non-whitespace tokens as it is for perl hackers to use split() to break
tokens at whitespace. So I really think STRING-TOKENIZE is going to take care
of you for the simple cases, and if you've got fancier requirements... then
you probably oughta code up a little custom parser for your app, anyway.
R5RS procedure string->number is far more generic than the
proposed string->integer -- and this may be a problem IMHO. For
example, string->number will try to read strings like "1/2" "1S2"
"1.34" and even "1/0" (the latter causing a zero-divide error). Note
that to Gambit's string->number, "1S2" is a valid representation of an
_inexact_ integer (100 to be precise). Oftentimes we want to be more
restrictive about what we consider a number; we want merely to read an
integral label.
-- procedure+: string->integer STR START END
Makes sure a substring of the STR from START (inclusive) till END
(exclusive) is a representation of a non-negative integer in decimal
notation. If so, this integer is returned. Otherwise -- when the
substring contains non-decimal characters, or when the range from
START till END is not within STR, the result is #f.
This is a can of worms. string->integer is undoubtedly useful. But so is
string->floating-point. What about base? Return #f or raise an error on
bad syntax?
Bornstein had a nice summary of the complexities involved:
I don't like this particularly. I can think of a kabillion variants on
parsing strings into numbers that I might find useful. The one that's
built-in is the right one since it's about Scheme read form (which you
gotta implement anyway). The moment you step into the territory of other
number formats, you should be ready to define a full suite of procedures to
deal with the plethora of possibilities.
> [SRFI-13]
> string-concatenate string-list -> string
> Append the elements of STRING-LIST together into a single _list_.
> Guaranteed to return a freshly allocated _list_.
Did you mean to say a 'string' (instead of a _list_)?
Yes, you are quite correct. Thanks; I've fixed the text.
SRFI-13 mentions that string-unfold is also called "anamorphism".
Do you want to point out that a foldr combinator (e.g.,
string-fold-right) is also called a "catamorphism"?
Excellent! Done.
From: Dan Bornstein <xxxxxx@milk.com>
Olin writes:
>C'mon. Do you really think that people would use STRING-SET ?
>STRING-FILL is an easier case to make. Let's see, that would be
Actually, my suggestions come from actual use. The Scheme variant that I'm
working on for work started out life as a functional-only system (that is,
no mutable data *at all*), and I ended up implementing string-set and using
it quite a bit. Do I have to rehash the issues of why working with
immutable data can be a big win?
Anyway, the straightforward implementation is simple:
(define (string-set str k ch)
(set! str (string-copy str)) ; or substring or whatever
(string-set! str k ch)
str)
Careful with that axe, Eugene! Never use SET! unless you really need a
true side-effect. Use LET:
(define (string-set str k ch)
(let ((str (string-copy str)))
(string-set! str k ch)
str))
and it (I know I harp on this) maintains the overall consistency of the
library. More consistency means easier to learn and easier to understand.
Big win.
I'm still maintaining that you are a freak with strange programming needs,
and that STRING-SET is really an uncommon op. Does anybody besides Dan
want to stand up for a pure-functional STRING-SET ?
I'd actually just as soon drop string-fill! as add string-fill (I don't
think I've ever had a compelling reason to use either), but I'm more in
favor of doing one or the other than leaving the asymmetry. For the
record:
(define (string-fill str ch)
(make-string (string-length str) ch))
Uhh... I'll add STRING-FILL if I get more support for it.
>>[issue with string-copy and string-copy! not taking parallel args]
>Yeah, you're right. However, your non-side-effecting STRING-COPY is subsumed
>by the STRING-REPLACE Welsh proposes below. I think I'll leave things as-is.
If by "as-is" you mean dropping the proposal for string-copy! then I'm for
that. If you mean simply leaving your original proposal where the two
procedures take different sets of args, then I'm against that. Again, I'm
not against the particular functionality (which seems useful to me), just
against calling two essentially different procedures by essentially similar
names.
I mean (1) adding STRING-REPLACE, and (2) keeping both my STRING-COPY and
my STRING-COPY!. I recognise the non-parallelism, but do not think it's
a big deal. However,
- I'm open to a better name for STRING-COPY or STRING-COPY!, to break
the bogus parallelism. I've considered STRING-BLT and STRING-MOVE;
don't think they're too good.
- I'm open to being beaten on more by others who want to back Dan up.
>[mismatch index with the (in)equality procedures] It turns out to be a
>handy value to have around if you are comparing strings.
However, requiring it means that implementations are precluded from using
certain short-cut optimizations, in particular, = and <> can't return
quickly based on the length of the arguments. I'm against returning
mismatch indices in the standard (in)equality functions, but do see their
benefit and would be in favor of specifying explicit
mismatch-index-returning procedures, not just because of the above
efficiency tweak but also because they would signal programmer intent. I
don't have a strong opinion about what these functions would be named,
"stringOP-mismatch-index" is an off-the-top-of-my-head suggestion.
string=-mismatch-index
string<-mismatch-index
etc.
Oops. Precluding short-cut optimisations is a bad thing. Hmm.
OK, here's my proposal: the string-length shortcuts are only available
for string= and string<>. So we will back out the mismatch-index
functionality from all the STRING=, STRING<, etc. functions -- they
now only return boolean values.
However, I know of no shortcuts for the STRING-COMPARE procedures that
are precluded by returning mismatch indices. So we'll leave that functionality
in place.
Now programmers can choose what they want.
I *could* have restricted only = and <>, and left <, >, <=, >= alone --
but that seems a little ugly to me.
I have modified the SRFI to reflect this change.