Email list hosting service & mailing list manager

hash functions should not be required to accept two arguments William D Clinger (09 Nov 2015 17:42 UTC)
Re: hash functions should not be required to accept two arguments William D Clinger (10 Nov 2015 12:02 UTC)
Re: hash functions should not be required to accept two arguments taylanbayirli@xxxxxx (10 Nov 2015 13:00 UTC)
Re: hash functions should not be required to accept two arguments William D Clinger (10 Nov 2015 13:50 UTC)
Re: hash functions should not be required to accept two arguments taylanbayirli@xxxxxx (10 Nov 2015 14:27 UTC)
Hash salt John Cowan (10 Nov 2015 06:37 UTC)
Re: Hash salt taylanbayirli@xxxxxx (10 Nov 2015 10:00 UTC)
Re: Hash salt John Cowan (11 Nov 2015 05:21 UTC)
Re: Hash salt Shiro Kawai (11 Nov 2015 05:59 UTC)
Re: Hash salt John Cowan (11 Nov 2015 06:22 UTC)
Re: Hash salt taylanbayirli@xxxxxx (11 Nov 2015 07:54 UTC)
Re: hash functions should not be required to accept two arguments taylanbayirli@xxxxxx (10 Nov 2015 09:58 UTC)

hash functions should not be required to accept two arguments William D Clinger 09 Nov 2015 17:25 UTC

SRFI 126 removes one prominent flaw in SRFI 69 (exposure of
hash functions compatible with the eq? and eqv? procedures)
but preserves an even more prominent flaw: the requirement
that all hash functions accept a second optional argument.

SRFI 126 also adds various features, some of them optional.
In my opinion, those added features are considerably less
important than the disadvantage imposed by requiring hash
functions to accept a second argument.

SRFI 126 states:

    A hash function is a procedure that maps keys to exact
    integer objects.

That is what a hash function should be.  That is not, however,
what the hash functions specified by SRFI 126 actually are.
Several paragraphs later, SRFI 126 tells the truth:

    A hash function is a procedure that must accept a key and a
    bound as an argument and should return a non-negative exact
    integer object.

That second argument "signifies that the function need not
return an integer greater than bound, although it may do so."

In plain English, that means the second argument to a hash is
worse than useless.  That second argument has the effect of
converting every hash function h into a family of one-argument
hash functions indexed by the second argument of h.  There is
no reason for clients or implementors of SRFI 126 to think
(h x 0) has anything to do with (h x 1), and so on.

So far as I can see, this needless indexing of the hash function
provides no benefits whatsoever.  In particular, the specific
reasons that some people have advanced for adding the second
argument do not hold up once you realize the second argument
can be (and often will be) ignored.

Clients of SRFI 126 will be required to use and to write hash
functions that accept that useless second argument, even though
implementations of SRFI 126 may never pass a second argument to
any hash function.  Indeed, the higher quality implementations
of SRFI 126 will be the ones least likely to pass a second
argument to any hash function, just as Larceny's implementation
of SRFI 69 never passes a second argument to any hash function.

In the other direction, implementations of SRFI 126 that do
pass a second argument to some hash functions must take care
to pass exactly the same second argument in every call to those
hash functions, unless they're ready and willing to rehash every
association in the table.

So requiring hash functions to accept a second argument increases
the complexity for both implementations and clients without
providing any identifiable benefit.

****************************************************************

Some unrelated issues:

On 10 September, Taylan asked:

    Do you think it would hinder adoption that the reader syntax is
    mandatory, or lead to other problems?

Yes, the mandatory reader syntax makes SRFI 126 much less likely
to be supported by implementations.  We have seen that happen with
several previous SRFIs (unrelated to hash tables).

On 13 September, John Cowan wrote:

    In any case, nobody is going to randomize the order:
    it is in fact deterministic for any particular set of keys,
    it's just that the ordering you get is complicated enough
    that users treat it as unpredictable.

The timing of garbage collections is not very deterministic, and
garbage collections can change the internal sizes and orderings
of hash tables.  That is especially likely for hash tables that
implement weak keys or similar complexities.

On 30 September, Alex Shinn wrote:

    Re: the bound argument, it's a promise that we don't
    need values larger than the bound, allowing the hash
    function to use fixnum-only arithmetic in most cases.  It
    applies not just in special cases like disk-based tables, but
    when the hash table consumes a large enough fraction
    of memory to want a bignum number of buckets.  As
    discussed on the SRFI-125 list, this can happen on
    32-bit architectures with 3 tag bits when the table takes
    around 1/4 of available memory.  It's realistic to want
    a small server where the table takes up nearly 100%
    of memory.

That reasoning fails for three or four separate reasons.

In SRFI 126, the bound argument is not a promise about anything.
It's nothing more than an extra argument the hash procedure may
use (if it so chooses) to select a different hash function than
it would have used had a different second argument been passed,
or no second argument at all.

Secondly, a 32-bit machine implies buckets occupy at least 8 bytes
(to hold 32-bit representations of a key and a value).  With 3-bit
tags, the largest positive fixnum is likely to be 536870911.
If you had that many 8-byte buckets on a 32-bit machine, you'd
have a grand total of 4 bytes left over for other purposes.

Thirdly, the number of distinct keys in a hash table is of the
same order of magnitude as the number of buckets.  Taking the
storage occupied by those distinct keys into account reduces the
number of buckets even further.

There is no realistic scenario in which you'd need a number of
buckets that exceeds the fixnum range that's available with 3-bit
tags.

Fourthly, there's no real problem with hash functions that use
bignum arithmetic.  Even in Larceny, whose bignum arithmetic is
notoriously slow, it's hard to find any use of hash tables for
which the use of bignum arithmetic in hash functions would be
significant compared to the overhead of hash tables themselves.

On 16 October, Alex Shinn wrote:

    eq?/eqv? hash tables with rehashing on gc,
    as in larceny, may be doing more work than
    is needed for some common eq?/eqv? cases.

    If the keys are only symbols, it would be
    reasonable to store a hash value directly in
    the symbol object, or even arrange for all
    symbols to be stationary in the gc.  To allow
    such optimizations, I think we should add:

    make-symbol-hash-table

Alex is assuming the implementation is remarkably stupid, but
Larceny's eq?/eqv? hash tables are not that stupid.  Larceny's
symbols do contain their own hash values.  Symbols, booleans,
numbers, and various other things whose eq?/eqv? hash values
aren't affected by a gc are not rehashed following a gc.

Making an API more complicated just because it is possible to
imagine an implementation that implements the API stupidly is
not good design.  The more complex the API, the more likely it
is to be implemented stupidly.

Will