On Tue, Nov 10, 2015 at 2:25 AM, William D Clinger <xxxxxx@ccs.neu.edu> wrote:

On 30 September, Alex Shinn wrote:

    Re: the bound argument, it's a promise that we don't
    need values larger than the bound, allowing the hash
    function to use fixnum-only arithmetic in most cases.  It
    applies not just in special cases like disk-based tables, but
    when the hash table consumes a large enough fraction
    of memory to want a bignum number of buckets.  As
    discussed on the SRFI-125 list, this can happen on
    32-bit architectures with 3 tag bits when the table takes
    around 1/4 of available memory.  It's realistic to want
    a small server where the table takes up nearly 100%
    of memory.

That reasoning fails for three or four separate reasons.

In SRFI 126, the bound argument is not a promise about anything.

Yes, the SRFI 126 specification of bound is not useful.
If specified at all, it should be required, and should provide
the guarantee that the same bound be consistently used
for all elements in any given table (i.e. changing happens
only with a rehash).

Secondly, a 32-bit machine implies buckets occupy at least 8 bytes
(to hold 32-bit representations of a key and a value).  With 3-bit
tags, the largest positive fixnum is likely to be 536870911.
If you had that many 8-byte buckets on a 32-bit machine, you'd
have a grand total of 4 bytes left over for other purposes.

Did you see my computation on the SRFI 125 mailing list which
gave the 1/4 available memory?

First, with 3-bit tags the max fixnum is 2^28-1 = 268435455,
half your value... and half the memory, which already means
we have more than enough remaining memory to implement
something like memcached.

Second, while the arguments I made for disk-based tables and
more general hashing are debatable, but we should consider
the simple case of hash-sets.  If we can't use the same hash
functions for sets as for tables, we've failed in our hash function
specification.  With sets you need no space for values, so that
again halves the space.

With these two factors you have my 1/4 estimate.

You can reduce this estimate further - it's typical to desire a load
factor around 75%.  There is also the case of MIT Scheme which
uses 6-bit tags (presumably they would just be required to use
bignums).

Thirdly, the number of distinct keys in a hash table is of the
same order of magnitude as the number of buckets.  Taking the
storage occupied by those distinct keys into account reduces the
number of buckets even further.

The keys could be fixnums.  Since we can use negative fixnums
we already have more possible keys than the maximum number
of buckets.

Fourthly, there's no real problem with hash functions that use
bignum arithmetic.  Even in Larceny, whose bignum arithmetic is
notoriously slow, it's hard to find any use of hash tables for
which the use of bignum arithmetic in hash functions would be
significant compared to the overhead of hash tables themselves.

I haven't run any benchmarks, but I think a simple lookup in a
hash table should not cons.

Regardless, if there is no explicit bound then as I said the SRFI
would need to clearly specify the implicit bound, presumably the
bignum UINT_MAX.

On 16 October, Alex Shinn wrote:

    eq?/eqv? hash tables with rehashing on gc,
    as in larceny, may be doing more work than
    is needed for some common eq?/eqv? cases.

    If the keys are only symbols, it would be
    reasonable to store a hash value directly in
    the symbol object, or even arrange for all
    symbols to be stationary in the gc.  To allow
    such optimizations, I think we should add:

    make-symbol-hash-table

Alex is assuming the implementation is remarkably stupid, but
Larceny's eq?/eqv? hash tables are not that stupid.

Yes, I already retracted this.

I still think we should include char-hash and number-hash, but
would not argue strongly for it.

-- 
Alex