An argument for string-split

An argument for string-split oleg@xxxxxx 18 Nov 1999 20:37 UTC
Olin,

	I'm sorry to keep on beating a dead horse. After this message
I won't say a word about string-split, I promise.

	Suppose we have a string 'str' consisting of tokens separated
by a #\: character. We can extract the tokens using either

	(string-tokenize str
		(char-set-difference char-set:full (char-set #\:)))
or
	(string-split str (char-set #\:))

the two procedure calls above are indeed roughly equivalent;
therefore, a String library should define only one of them. Let's
assume however that the string 'str' in question is in Unicode. It's
not that far-fetched; Gambit and Kawa for example support Unicode. It
appears that the two above procedures would not be equivalent as far
as computing resources and efficiency are concerned. Gambit, for
example, implements char-sets internally to parse input. Low-ascii
characters are marked in a vector, while other characters are in a
list. In this implementation, char-set:full, or char-set:full without
one character are quite large (given the full Unicode). It may take
quite a while to do a membership test in such a set. In contrast,
(char-set #\:) is far smaller and more efficient. It indeed appears
that some problems lend themselves to delimiter-based parsing while
the others do to the inclusion semantics.

	I agree with you that a general string-split is just too
complex. There are many things we may want to do, including splitting
by regular expressions. However, I'd like to define string-split
_only_ like:

	string-split STRING CHARSET [MAXSPLIT]
	string-split-ws STRING [MAXSPLIT]

The generic string-split does not disregard leading, trailing and
repeating delimiters. Unless MAXSPLIT is specified, the resulting
list always contains as many elements as there are delimiters in the
STRING, plus one. For example,
	(string-split ":+" (char-set #\: #\+)) ==> ("" "" "")
string-split-ws splits by whitespaces. It trims leading, trailing and
repeating whitespaces:
	(string-split-ws " abc\n\r\n\rd e f\t  ") ==>
		("abc" "d" "e" "f")

Note that the MAXSPLIT argument serves the same purpose as the END
range delimiter. For example, string-tokenize accepts 'start' and
'end' optional arguments; in which case it operates on a substring of
the original string. The MAXSPLIT argument of string-split likewise
constraints parsing only to a part of the original string. It
appears to me however that MAXSPLIT is slightly more convenient in the
context string-split is commonly used; you don't need to scan the
string to find the location where you want the parsing
stop. Furthermore, the MAXSPLIT argument can easily solve some of the
problems you posed:

>    - Number of fields to parse. You might want
>             - as many as exist,
	==> omit MAXSPLIT
>             - exactly N, or error,
>             - at least N, or error.
	==> set MAXSPLIT to N+1 and then count the elements in the
	resulting list. It's up to you to report an error.

BTW, string-split-ws can easily be overloaded with a generic
string-split. I agree "Scheme makes it difficult to have different,
independent sets of optional args, since you have to order
them". However, we can use types of arguments to disambiguate a
call. For example, MAXSPLIT must be an integer.

	If someone finds string-split useful for his particular
problem, I'm happy. If he finds that his requirements are more complex
-- then he probably ought to use string-tokenize, or a more general
parsing engine. Isn't that the approach of string-tokenize?

	BTW, the same approach -- define the most generic and a few
most common procedures -- applies equally to string->integer.

> This is a can of worms. string->integer is undoubtedly useful. But so
> is string->floating-point. What about base? Return #f or raise an
> error on bad syntax?

Well, if someone wants to get a floating-point number, or deal with
base, bitstrings etc., he should use string->number. string->integer
is only to parse a string of digits. Period. It's a one-trick
pony. string->integer never raises any error: if it does not like
something it returns #f.
Speaking of string->number, do you want to make it accept (start, end)
arguments? This could be quite convenient...

	Why is string->integer singled out, of all the possible
string->xxx-number?
	- Because, IMHO, dealing with a sequence of digits
is quite common;
	- Because this conversion can be made rather efficient
(base 10 can be hardwired; multiplication by 10 can be done with two
shifts and an addition).
	- Because it keeps one from surprises. For example, I had to deal
with forms made of groups of digits. Occasionally some other characters
may crop up; in which case I had to report an error intelligently.
I wrote
	(let ((val (string->number token)))
	  (and (integer? val) (do-deal-with val)))
Imagine my surprise when the token happened to be "1S2" yet the val passed
the integer? test. Eventually the faulty token triggered another check
and the error got reported, but recovery became more complex. Also,
imagine what happens given the token "1/0". You're either at the mercy
of your implementation's exception system, or you have to prescan the
token to make sure it is indeed made of digits.

	Sorry for ranting.