Re: shared-text substrings

Show/hide message thread
Re: shared-text substrings Olin Shivers (24 Mar 2000 00:27 UTC)
Re: shared-text substrings sperber@xxxxxx (24 Mar 2000 15:53 UTC)
Re: shared-text substrings Olin Shivers 24 Mar 2000 00:27 UTC
Howdy, everyone. The thing is: I invented (stumbled over?) this new sorting
algorithm for lists. So my recent weekends have been spent doing the proofs
and writing the paper instead of drawing up my replies to the recent posts. My
apologies. Here are my comments on shared-text. Let's move on from here; it's
time to get the Unicode issues resolved and that will be that.

I have also recently been wandering through Java's spec for Unicode strings,
but will put the results from that in a following msg. Once again, I apologise
for the delay in the SRFI-13 progress.

Here are my comments on shared-text string procs and library size.

   Well, sure, anyone can see what you're talking about.  You still don't
   have hard, experimental data that all of these things will make a
   difference in practice.

Yes, true enough. But I cannot sit down and do a careful study; that would be
an enormous effort. I can report from my experience that these things *have*
made a difference in my use -- I can't report on timings, but I certainly have
experience with heap problems.

   Olin> You don't *have* to use these procedures. You can always, in a GC'd,
   Olin> functional language, play it safe and use pure functions. These
   Olin> procedures simply give you a portable way to write efficient code. I
   Olin> like to write efficient code. I have been burned using inefficient
   Olin> code.

   But it's a fallacy that additional functionality in a SRFI hurts
   nobody.  The sheer size of SRFI 13 will make people who need only a
   handful of string ops shrink away from it, and the associated
   complexity will make it harder to handle in practice.

Languages should be small. Libraries should be large. It's about providing a
rich vocabulary for programming. The library writer gets it right once, and no
one ever has to re-invent the wheel. It is not hard to handle this library --
shoot, the *entire* reference implementation was written by one guy in his
spare time. It's only 700 lines of code -- even though it's bummed pretty
tightly for speed, e.g., I replicated loops to speed up special cases. 700
lines doesn't compile down to too many bytes. The only hard thing about it was
the KMP and Boyer-Moore code, which was very painful -- but only a handful of
lines of code. Subtle algorithms requiring careful reasoning. Which makes them
prime candidates for being packaged up once & for all.

As for the complexity of the interface: The library presents the procedures
organised by general function, so that humans can index into it & get the
operation they need. It's carefully designed to have a lot of regular
structure, so that in general, you don't really need to even precisely
remember the name and parameter-passing conventions for a particular operation
-- if you just "make it up" using the general rules, you'll get it right.

Let me tell how I got started writing libraries. I have a friend, Jin Choi.
Jin is an expert scsh programmer, actually has done stuff down in the
sources. One day, Jin told me about some little one-shot text-munging task
he'd done on the log files at his web startup. I asked him what language he'd
used. He replied: perl. I said, "whoa -- why'd you do that? Most commoners
would use perl, but you know a better way. Why didn't you use scsh?" He said,
"You don't understand. I just wanted to read in the lines, pick out a date
field with a regexp, uppercase the month, and write things back out in a
rearranged order."  I said: "Fine, no problem. Scsh has regexps. Scsh has
awk. Scsh has strings.  You need to uppercase a string? Scsh has loops,
CHAR-UPCASE, everything you need. It's a complete programming language." Jin
said -- and here is the point -- "You don't understand. I just wanted to say:
upcase the string.  I didn't want to write a loop." I was quiet for a
second. Then I said: "OK. You're right. I'm wrong." And I went home that
weekend and started writing a string library, so programmers can express
themselves more easily. Jin did not want to write a loop and implement
STRING-UPCASE by hand. What he wanted to do was rearrange his log files.

I don't care that you think a particular set of the procs in the library are
not useful. Someone else out there disagrees with you, and they additionally
think the procs that you like are not useful. Now, I'm only willing to put in
widely useful procedures that help people write efficient, portable
code... but I don't feel it's ultra-important to build a minimalist
library. Such a library would be MAKE-STRING, STRING-SET!, STRING-LENGTH,
STRING-INDEX, and STRING?.  That's not the point. People want to slice
strings. And scan them. And map over them. And search them. And compare
them. I use these operations in my code. I've done a lot of string
programming, because of my involvement with scsh. My code would be better --
clearer, less buggy, easier to read, write and maintain -- if I could rely on
having these operations around.

If I *don't* put these procs in the SRFI, then people will continually
re-invent & reimplement them. The bugs and aggravation are one cost.
Even more irritating is that no two re-inventions will be exactly alike.
They'll be roughly the same thing... but always slightly different: different
parameter-passing conventions, different boundary cases. What a mess. Get
it set down and be done with it. Let's move on to network protocols or
parser generators or URLs or something more interesting.

Guy Steele has a nice talk called "Growing a language."
    http://www.sun.com/research/java-topics/pubs/98-oopsla-growing.ps
I don't want to give away too much about the talk, since it ruins the
paper if you know some things about it in advance (it is quite fun to
read or hear), but it makes the case well that it is very limiting to have to
express things in a restricted "vocabulary."  One of Guy's points is that
languages should allow you to grow the vocabulary. SRFI-13 follows this
notion: we want an expressive, rich vocabulary for programmers, so they
don't spend all their time defining these little building blocks. Small,
extensible languages. Large, powerful libraries.

   I'm not saying shard-text substrings aren't useful.  But there's no
   good reason not to move it to a different SRFI.

SRFI-13 isn't providing shared-text substrings! It has a few calls that allow
users to indicate to the system that results may share storage with
parameters. Scheme implementations that *do* have shared-text substrings can
exploit this.  Scheme implementations that *don't* have shared-text substrings
can ignore it, or exploit a few simpler cases. In *either* case, the code is
portable. Programmers that *do* want this efficiency win can exploit the spec;
programmers that *don't* can use the careful, pure-functional, less-efficient
bindings.

   This SRFI tries to do too many things at once.

This SRFI tries to codify and standardise common, useful string operations,
in a portable and efficient manner. Period. It's a toolkit for string hacking.

   A good indicator is the fact that the discussion period is once again
   exceeding the limit.

The fault lies closer to home: this SRFI has gone over time because I was
unable to work on it during the winter and early spring.
    -Olin