An argument against cursors oleg@xxxxxx 02 May 2003 20:16 UTC

This is a non-theoretical argument against cursors and for
enumerators. We argue against cursors (aka iterators) as the primary
means of traversing and obtaining all values of a collection.
Enumerators such as left fold are superior to cursors:
      - in efficiency
      - in ease of programming
      - in more predictable resource utilization and
      avoidance of resource leaks, especially for collections backed
      by external resources such as file handles and database
      connections.

I'd like to remark first about the merits of a proposal to rename
iterators into cursors. I believe the name 'iterator' is confusing: in
C++ it means an accessor, which does not iterate over
collection. Rather, it is _being_ iterated. On the other hand,
languages like OCaml provide methods 'iter' that actively traverse a
collection, applying a user-defined function to elements.  It might
also make sense to use the database terminology to a large extent. A
relational database is the ultimate collection, which can represent
any single collection discussed in SRFI-44.

I would like to state that I admire the comprehensive proposal by
Scott Miller. Collections is a vast area, which I, for one, was afraid
to touch. Without Scott Miller's proposal, we would have nothing to
discuss.

If I may be permitted a philosophical remark: when it comes to
collections and other libraries, perhaps we could borrow more from
languages that resemble Scheme more (such as SML, OCaml, Ruby,
Haskell) than from languages that resemble Scheme less (C++, Java).

There are many theoretical and fundamental reasons why enumerators
such as fold are profoundly better than cursors. This message will not
discuss them. Instead, we concentrate on utterly practical
considerations: performance, easy of programming, and resource
utilization.

Writing a procedure that traverses, say, an AVL tree, in some order is
far easier than writing a cursor that remembers the current position
and can generate the next one. This is especially tricky if the tree
in question does not have parent pointers (which waste space and lead
to sharing and resource leakage problems). One look at the C++ STL
should show how wretched programming of iterators could be.

Cursors are simply less efficient. A cursor must maintain the state of
the traversal. That state may be invalid. That is, each access to the
cursor (to advance it or to read a value) must first verify that the
cursor is in the valid state. The cost of the checks become
exorbitant.

This issue of inefficiency of cursors is well known. For example, it
is highly advisable to use "native" operations to move large sections
of an array (ArrayCopy in Java) rather than copy
element-by-element. The cost of a range check on each access to the
array is one of the considerations.

Databases give another example. It is often said that the key to the
best performance is to do more on the server rather than on the
client. To find the maximum value of a column, it's far faster to
       select MAX(col) from collection
than to open a cursor on the collection and retrieve all values from
'col' searching for the maximum. Likewise, to increment all the values
in col, it's far better to
	update collection set col = col + 1
than to use an updateable cursor. Stored procedure languages (some of
which are quite sophisticated) were introduced to make the server-side
processing easier and powerful.  We should stress that
	select MAX(col) from collection
is precisely equivalent to
	foldl max lower_bound collection

Another benefit of enumerators is the ease of managing resources of
the collection -- especially precious resources.  An enumerator can be
programmed as

	   (define (enum proc collection)
	       (let ((database-connection #f))
		(dynamic-wind
		   (lambda ()
		      (set! database-connection
			    (open-connection collection)))
		   (lambda () (iterate proc))
		   (lambda ()
		      (set! database-connection
		            (close-connection database-connection))))))

If 'proc' does not capture its continuation -- as it is often the case
-- the database connection is opened (taken from the pool) at the
beginning of the iteration and is returned to the pool at the end.  If
we were to expose access to our collection in the form of a cursor, we
would have to place the variable 'database-connection' into that
cursor. We cannot close the database until a cursor is alive. But how
do we know when there are no alive cursors and therefore it is safe to
recycle the database connection? That alone requires a finalizer -- a
non-standard, unsupported and unreliable feature. If that was not
enough, there is another problem: a cursor may stick around (being
referenced from local variables) even if it would not be used
according to the logic of the program. For example:
      (let* ((cursor (open-collection))
	     (elem (cursor)))
	   (+ 1 (long-computation elem)))

We cannot close our database connection until long-computation
finishes.  We contend that SRFI-44's current mandate to use iterators
will cause great resource leakage, especially for collections with an
external (file, database, network connection) backing.

The most general enumerator for a collection is a left fold. It can
handle complex seeds, and the early termination (by invoking a
continuation or throwing an exception). Still, it might be beneficial
to generalize fold to the following

   coll-fold-left PROC COLL SEED SEED ... -> [SEED ... ]

It takes one or more seeds and returns just as many. PROC is a
procedure:
	PROC VAL SEED SEED ... -> [INDIC SEED SEED ...]

It takes n+1 arguments and returns n+1 values. The first argument is
the current value retrieved from the collection. The other n arguments
are the current seeds. The first return value is a boolean indicator:
#t means continue iterating, #f causes the premature termination.
Other return values of PROC are the new values of the seeds.

A left fold with a similar interface has in fact been implemented --
specifically to access databases in a functional way:
	http://pobox.com/~oleg/ftp/Scheme/lib/db-util1.scm
Here's a simple example of its usage:
	http://pobox.com/~oleg/ftp/Scheme/tests/vdbaccess.scm

I am using this database fold in the production environment. See, for
example,
	http://www.metnet.navy.mil/Metcast/Code/gief-serve.scm

It is a really complex example, that uses EXISTS and other subqueries,
table expressions, collection datatypes, geodetic datablade datatypes
and functions, and other frills to make one sick.

The example shows that it is possible to present a functional
interface even to such complex collection as object-relational
databases. Under the hood we can use cursors and ODBC -- and still
present a functional enumerator-based interface to the user.

Finally, if we really need cursors for our collection, we can get them
_for free_: A conversion from an enumerator to a cursor is automatic:
the corresponding procedure was posted on the SRFI-39 mailing list on
Apr 2, 2003. See also
       http://pobox.com/~oleg/ftp/Scheme/enumerators-callcc.html

Incidentally, it might make sense to require the SRFI-44 functions
'collection-values' to return a stream rather than a list. A stream
is, essentially, a realization of a cursor.

Again, it is far more efficient and easy for a programmer to implement
cursors via enumerators, than the other way around.