Re: Wording of the rationale

Re: Wording of the rationale Aubrey Jaffer 12 Nov 2006 23:40 UTC
 | Date: Tue, 07 Nov 2006 13:49:06 +0100
 | From: =?ISO-8859-1?Q?Jens_Axel_S=F8gaard?= <xxxxxx@soegaard.net>
 |
 | I am little unhappy about the wording of the rationale:
 |
 |      RATIONALE
 |
 |      When SRFI 32 Sort Libraries was withdrawn, it had 28
 |      procedures for what should have been a clean, simple
 |      interface.
 |
 | In my view SRFI-32 *is* clean and simple.  Comparing the number 28
 | to the number of procedures defined in SRFI-95 is unfair, since
 | SRFI-95 covers the "What"-part but not the "How to"-part of
 | SRFI-32.
 |
 |      These 28 arose for several reasons:
 |
 |         * List and Vector versions were separated, even though
 |         Scheme has manifest types which eliminate ambiguity as to
 |         the caller's intentions.
 |
 | Separating list and vector functions are tradition.  In R5RS we
 | have, for example list-ref, vector-ref and string-ref instead of
 | general ref.  One benefit of this choice is that compilers get a
 | fair chance to generate efficient code.

LIST-REF and VECTOR-REF are primitive operations which compilers can
inline.  It is conceivable that some really advanced compiler could
inline merge, but merge is alread restricted to lists!

Sort is a more complicated operation than merge.  The cost of the
single type predicate conditional to detect array versus list is going
to be significant only when sort is called with a trivially short
argument.  It is reasonable to expect that library routines will be
slower than inline code on these degenerate cases.

 |         * Four varieties of algorithms were provided (quick, heap,
 |         insert, merge) even though quick, heap, and insert sorts
 |         have no significant advantage over merge-sort.
 |
 | First: The choice between these algorithms were in the "How
 | to"-part of SRFI-32.  If a user doesn't care, which algorithm is
 | used, he can use list-sort!, vector-sort!.
 |
 | Second: What does "no significant advantage" mean?

It means that the asymptotics or functional properties are no better.

 | I were of the impression, that the "hidden constants" of O(n log
 | (n)) of vector-quick-sort were smaller than that of
 | vector-merge-sort.

Is that from reading SRFI-32?

SRFI-95 does not mandate merge-sort.  It mandates that the sort
algorithm be stable and that asymptotic time for SORT be bounded by
O(n log(n)).

 | Third: The different algorithms have different memory requirements.
 | If very large vectors are sorted, it makes sense to choose a slower
 | algorithm that allocates very little extra memory.  Some versions
 | of vector-merge-sort! allocate a helper array the size of the
 | original.

http://en.wikipedia.org/wiki/Sorting_algorithm has a table of
properties for sort algorithms.  "In-place merge sort" is shown as
stable, O(n log(n)) time, and using no additional memory.

SRFI-95's requirement that the KEY argument be called at most once per
element necessitates allocation of O(n) storage for SORT when a KEY
argument is supplied.  The current reference implementation always
does; I will clone its routine to handle the no-KEY case efficiently.

 | Also: Is merge-sort the fastest algorithm, if the elements are
 | "almost sorted"?
 |
 |        * Sort and stable-sort varieties were included, even though
 |        an unstable sort has no performance advantage.
 |
 | As I see it, the reason they are in SRFI-32 is not efficiency but
 | completeness.  Nevertheless, if vector-quick-sort! (which is
 | unstable) is faster than vector-merge-sort! (which is stable) then

"Faster" _is_ about efficiency, isn't it?

 | it makes sense to let vector-sort! default to vector-quick-sort!
 | and let vector-stable-sort! default to vector-merge-sort! in the
 | name of efficiency.

General purpose software libraries are about simplicity and ease of
use, not theoretical perfection in algorithm design.

A library should be specified so that its routines will perform
moderately well for moderately sized inputs in the vast majority of
applications.  Having many variants has disadvantages:

 * When there are only a few paths through the code, the code gets
   thoroughly tested and its behavior well understood.  When there are
   many paths, most of the code is not well tested and not well
   understood.

 * To optimally choose algorithms requires nearly as much
   understanding as to write them.  Most users don't.

 * A module with too many functions and voluminous documentation
   scares off the typical user looking to just sort a 50-element list;
   who then goes searching for any old sort algorithm to reinvent the
   wheel.

 * If some of the default sorts are unstable, then users will be
   surprised that sorting their data twice results in different
   orders; or that vector and list sorts return different orders.