I have altered the SRFI to make the following changes: - "prefix-count"/"suffix-count" lexeme changed to "prefix-length"/"suffix-length" These are now these ------------------------- -------------------------- string-prefix-count string-prefix-length string-suffix-count string-suffix-length string-prefix-count-ci string-prefix-length-ci string-suffix-count-ci string-suffix-length-ci substring-prefix-count substring-prefix-length substring-suffix-count substring-suffix-length substring-prefix-count-ci substring-prefix-length-ci substring-suffix-count-ci substring-suffix-length-ci - string-comparison functions now return a simple boolean string<> string= string< string> string<= string>= string-ci<> string-ci= string-ci< string-ci> string-ci<= string-ci>= substring= substring<> substring-ci= substring-ci<> substring< substring> substring-ci< substring-ci> substring<= substring>= substring-ci<= substring-ci>= Note that these comparison functions still return a mismatch index: string-compare substring-compare string-compare-ci substring-compare-ci What follows is general discussion and replies to msgs from Oleg & Dan. -Olin ------------------------------------------------------------------------------- From: xxxxxx@pobox.com If I may I'd like to propose two more functions, string->integer and string-split. A general STRING-SPLIT is just too complicated for me. Here are the variants we want to support: - Variant grammars, e.g. tolerant-infix, strict-infix, and suffix - Optional substring indices - Number of fields to parse. You might want - as many as exist, - exactly N, or error, - at least N, or error. - Do contiguous runs of delimiter chars make a single delimiter, or to they designate empty-string tokens? Scheme makes it difficult to have different, independent sets of optional args, since you have to order them. The "field parser" utilities I wrote for scsh's awk utility (see the scsh manual) handle all this complexity, and more -- you can specify tokens or delimiters with general regexps, not just char sets. But this is much hairier machinery than I feel is appropriate for a basic string library. I dodged at least the grammar issue with STRING-TOKENIZE by having you specify not the separator chars but the token chars -- contiguous runs of token chars make a token. End of story. And I just punted the number-of-fields issue, leaving only the substring indices as possible optional args, so things worked out -- given my low ambitions. Some problems are more elegantly and efficiently expressed in terms of inclusion, some other are in terms of delimiting. I found for example that in Perl and Python split() is a rather often-used function. Yeah, perl hackers use split() a lot, for sure. But the char-set SRFI provides a CHAR-SET:GRAPHIC set, which makes it as easy to use STRING-TOKENIZE to pick out non-whitespace tokens as it is for perl hackers to use split() to break tokens at whitespace. So I really think STRING-TOKENIZE is going to take care of you for the simple cases, and if you've got fancier requirements... then you probably oughta code up a little custom parser for your app, anyway. R5RS procedure string->number is far more generic than the proposed string->integer -- and this may be a problem IMHO. For example, string->number will try to read strings like "1/2" "1S2" "1.34" and even "1/0" (the latter causing a zero-divide error). Note that to Gambit's string->number, "1S2" is a valid representation of an _inexact_ integer (100 to be precise). Oftentimes we want to be more restrictive about what we consider a number; we want merely to read an integral label. -- procedure+: string->integer STR START END Makes sure a substring of the STR from START (inclusive) till END (exclusive) is a representation of a non-negative integer in decimal notation. If so, this integer is returned. Otherwise -- when the substring contains non-decimal characters, or when the range from START till END is not within STR, the result is #f. This is a can of worms. string->integer is undoubtedly useful. But so is string->floating-point. What about base? Return #f or raise an error on bad syntax? Bornstein had a nice summary of the complexities involved: I don't like this particularly. I can think of a kabillion variants on parsing strings into numbers that I might find useful. The one that's built-in is the right one since it's about Scheme read form (which you gotta implement anyway). The moment you step into the territory of other number formats, you should be ready to define a full suite of procedures to deal with the plethora of possibilities. > [SRFI-13] > string-concatenate string-list -> string > Append the elements of STRING-LIST together into a single _list_. > Guaranteed to return a freshly allocated _list_. Did you mean to say a 'string' (instead of a _list_)? Yes, you are quite correct. Thanks; I've fixed the text. SRFI-13 mentions that string-unfold is also called "anamorphism". Do you want to point out that a foldr combinator (e.g., string-fold-right) is also called a "catamorphism"? Excellent! Done. From: Dan Bornstein <xxxxxx@milk.com> Olin writes: >C'mon. Do you really think that people would use STRING-SET ? >STRING-FILL is an easier case to make. Let's see, that would be Actually, my suggestions come from actual use. The Scheme variant that I'm working on for work started out life as a functional-only system (that is, no mutable data *at all*), and I ended up implementing string-set and using it quite a bit. Do I have to rehash the issues of why working with immutable data can be a big win? Anyway, the straightforward implementation is simple: (define (string-set str k ch) (set! str (string-copy str)) ; or substring or whatever (string-set! str k ch) str) Careful with that axe, Eugene! Never use SET! unless you really need a true side-effect. Use LET: (define (string-set str k ch) (let ((str (string-copy str))) (string-set! str k ch) str)) and it (I know I harp on this) maintains the overall consistency of the library. More consistency means easier to learn and easier to understand. Big win. I'm still maintaining that you are a freak with strange programming needs, and that STRING-SET is really an uncommon op. Does anybody besides Dan want to stand up for a pure-functional STRING-SET ? I'd actually just as soon drop string-fill! as add string-fill (I don't think I've ever had a compelling reason to use either), but I'm more in favor of doing one or the other than leaving the asymmetry. For the record: (define (string-fill str ch) (make-string (string-length str) ch)) Uhh... I'll add STRING-FILL if I get more support for it. >>[issue with string-copy and string-copy! not taking parallel args] >Yeah, you're right. However, your non-side-effecting STRING-COPY is subsumed >by the STRING-REPLACE Welsh proposes below. I think I'll leave things as-is. If by "as-is" you mean dropping the proposal for string-copy! then I'm for that. If you mean simply leaving your original proposal where the two procedures take different sets of args, then I'm against that. Again, I'm not against the particular functionality (which seems useful to me), just against calling two essentially different procedures by essentially similar names. I mean (1) adding STRING-REPLACE, and (2) keeping both my STRING-COPY and my STRING-COPY!. I recognise the non-parallelism, but do not think it's a big deal. However, - I'm open to a better name for STRING-COPY or STRING-COPY!, to break the bogus parallelism. I've considered STRING-BLT and STRING-MOVE; don't think they're too good. - I'm open to being beaten on more by others who want to back Dan up. >[mismatch index with the (in)equality procedures] It turns out to be a >handy value to have around if you are comparing strings. However, requiring it means that implementations are precluded from using certain short-cut optimizations, in particular, = and <> can't return quickly based on the length of the arguments. I'm against returning mismatch indices in the standard (in)equality functions, but do see their benefit and would be in favor of specifying explicit mismatch-index-returning procedures, not just because of the above efficiency tweak but also because they would signal programmer intent. I don't have a strong opinion about what these functions would be named, "stringOP-mismatch-index" is an off-the-top-of-my-head suggestion. string=-mismatch-index string<-mismatch-index etc. Oops. Precluding short-cut optimisations is a bad thing. Hmm. OK, here's my proposal: the string-length shortcuts are only available for string= and string<>. So we will back out the mismatch-index functionality from all the STRING=, STRING<, etc. functions -- they now only return boolean values. However, I know of no shortcuts for the STRING-COMPARE procedures that are precluded by returning mismatch indices. So we'll leave that functionality in place. Now programmers can choose what they want. I *could* have restricted only = and <>, and left <, >, <=, >= alone -- but that seems a little ugly to me. I have modified the SRFI to reflect this change.