Re: must SRFI-128 mustard be taken seriously?

Re: must SRFI-128 mustard be taken seriously? William D Clinger 11 May 2016 11:45 UTC
John Cowan wrote:

> > I'm basically saying the same argument holds for circular lists
> > as well as for vectors.  The argument would also hold for any
> > other structures that might be circular, such as procedures
> > whose representations involve circularity (as is often the case
> > because that's a standard way to implement recursion and scope).
>
> In that case, all the objects affected are out of scope for SRFI 128,
> and you can do whatever you like: the rules given in the SRFI don't apply.

The problem is that I can't do whatever I like on vectors and
lists until I have determined that the argument to default-hash
is circular, which is an expensive determination.  Making the
default-hash procedure even slower on most structured arguments
is not a realistic option.

By the way, I was able to get the sample implementation of SRFI 128
to run to completion on my benchmark by using Larceny's regional
collector, which is more resilient against near-worst-case storage
allocation.  Here are the numbers measured over all bytevectors,
non-circular vectors, and non-circular lists found within Larceny's
R7RS heap:

collisions (%)   rms/bucket     nanoseconds/hash

bytevectors
       62.0%       31.10            1209        R6RS equal-hash
       62.0%       31.10            1206        SRFI 126 equal-hash
       55.9%       31.06          251213        SRFI 128 default-hash
       55.9%       31.06          241068        SRFI 125 hash-by-identity

vectors
       36.9%        7.04             705        R6RS equal-hash
       36.9%        7.04             719        SRFI 126 equal-hash
       31.2%        7.02          430671        SRFI 128 default-hash
       31.2%        7.02          436762        SRFI 125 hash-by-identity

lists
       57.9%       20.79             552        R6RS equal-hash
       57.9%       20.77             700        SRFI 126 equal-hash
       48.7%       29.06          374208        SRFI 128 default-hash
       48.7%       29.06          385643        SRFI 125 hash-by-identity

The first column shows the percentage of hash collisions, but the
second column is more important.  That second column shows the
root mean square number of objects per bucket when the number of
buckets is equal to the number of objects.  (If those numbers are
higher than you'd expect, please remember that all empty vectors
and bytevectors must have the same hash code, and there are multiple
copies of many other short bytevectors, vectors, and lists lying
around in Larceny's R7RS heap.)

As can be seen from the third column, the SRFI 128 and SRFI 125
sample implementations are two to three decimal orders of magnitude
slower than SRFI 126 without delivering any meaningful improvement.
By not guaranteeing termination on all arguments, the hash functions
specified by SRFI 128 and SRFI 125 actually deliver *less* at a far
higher cost.

Some of the poor performance you're seeing in that table may be
due to the way SRFI 128's sample implementation is written, but
a lot of that poor performance is attributable to the SRFI 128
mustard.  I didn't realize the mustard was so bad until I tried
to fix the performance problem and realized the obvious fix was
blocked by SRFI 128 mustard.

> > If that's the intent, then the SRFI 128 mustard was intended to
> > preclude efficient implementation of an always-terminating version
> > of default-hash, because the standard way to achieve an efficient
> > always-terminating hash function looks at only a bounded portion
> > of its argument.
>
> Do I understand correctly that when given a non-circular vector, the
> SRFI 126 hash procedure examines only at most a fixed number of elements?

You are conflating specification with implementation.  SRFI 126
does not contain any mustard that requires its equal-hash procedure
to examine every part of its argument.  SRFI 126 allows equal-hash
to examine only part of its argument (but does not require that
behavior).  SRFI 126 does require equal-hash to terminate on all
arguments, however, and the standard way to accomplish that is
to hash on only a bounded portion of the argument.

As a matter of current practice, most (if not all) implementations
of (rnrs hashtables) export an implementation of equal-hash that
examines only a bounded portion of any argument that might be
circular.  The sample implementation of SRFI 126 defines equal-hash
to be the same as R6RS equal-hash whenever the (rnrs hashtables)
library is available.  (It ought just to re-export that binding,
but that's a different matter.)

That means there are some systems in which the sample implementation
of SRFI 126 defines an equal-hash that examines only a bounded
portion of arguments that might be circular, but there might also
be a few systems in which the sample implementation of SRFI 126
defines an equal-hash that uses the much slower technique of
examining every element of every nested vector, using hash tables
to avoid non-termination.

Of course, there might also be some systems that try to use
the sample implementation of SRFI 126 on top of the sample
implementation of SRFI 69, yielding an equal-hash that doesn't
always terminate, but that implementation technique is forbidden
by SRFI 126 mustard:  "Like equal?, the equal-hash procedure must
always terminate, even if its arguments contain cycles."

> That seems to me problematic.  The original version of Java's String#hash
> function examined at most 16 characters of the string being hashed,
> which tended to cluster results badly.  Consequently, Java 1.2 replaced
> it with a function that examined all the characters, essentially similar
> to the SRFI 128 sample implementation.  Undoubtedly the new algorithm
> was slower in many cases, but it delivered better results.

Strings can't be circular, so SRFI 126 equal-hash can't get into
trouble by examining every element of a string.  Examining only
16 characters of a string is undoubtedly problematic, but basing
the hash upon some bounded number of characters (say 128) should
work pretty well and would run faster on extremely large strings.
For proof, see the numbers for bytevectors in the table above:
Larceny's implementation of R6RS and SRFI 126 equal-hash examines
only a bounded number of a bytevector's elements (roughly 35 IIRC),
but the quality of its hashes is clearly on par with the hashes
computed by the sample implementations of SRFI 128 and 125.
That's beside the point, of course, because there is no danger of
non-termination when hashing strings or bytevectors.

If you're using Java specifications as precedent, you should note
that Java does not require its hashCode() methods to examine every
part of the object being hashed, and does not require the values
returned by x.hashCode() to be a mathematical function of the hash
values of x's components.  So far as I know, that mustard is unique
to SRFI 128.

I guess it's too late to repair SRFI 128, but I would urge Working
Group 2 to avoid these mistakes in R7RS-large.

We might still have time to repair SRFI 125.  Instead of strongly
encouraging use of SRFI 128 comparators, SRFI 125 could strongly
encourage use of SRFI 1xx comparators, where SRFI 1xx is a SRFI
not yet written that would be almost identical to SRFI 128 but
omit the problematic mustard.  Another alternative is for SRFI 125
to deprecate the problematic mustard of SRFI 128, but that might
be going too far in the direction of trying to make substantive
changes to a finalized SRFI.

Will