On Tue, Nov 10, 2015 at 2:25 AM, William D Clinger <xxxxxx@ccs.neu.edu> wrote:

On 30 September, Alex Shinn wrote:

Re: the bound argument, it's a promise that we don't
need values larger than the bound, allowing the hash
function to use fixnum-only arithmetic in most cases. It
applies not just in special cases like disk-based tables, but
when the hash table consumes a large enough fraction
of memory to want a bignum number of buckets. As
discussed on the SRFI-125 list, this can happen on
32-bit architectures with 3 tag bits when the table takes
around 1/4 of available memory. It's realistic to want
a small server where the table takes up nearly 100%
of memory.

That reasoning fails for three or four separate reasons.

In SRFI 126, the bound argument is not a promise about anything.

Yes, the SRFI 126 specification of bound is not useful.

If specified at all, it should be required, and should provide

the guarantee that the same bound be consistently used

for all elements in any given table (i.e. changing happens

only with a rehash).

Secondly, a 32-bit machine implies buckets occupy at least 8 bytes
(to hold 32-bit representations of a key and a value). With 3-bit
tags, the largest positive fixnum is likely to be 536870911.
If you had that many 8-byte buckets on a 32-bit machine, you'd
have a grand total of 4 bytes left over for other purposes.

Did you see my computation on the SRFI 125 mailing list which

gave the 1/4 available memory?

First, with 3-bit tags the max fixnum is 2^28-1 = 268435455,

half your value... and half the memory, which already means

we have more than enough remaining memory to implement

something like memcached.

Second, while the arguments I made for disk-based tables and

more general hashing are debatable, but we should consider

the simple case of hash-sets. If we can't use the same hash

functions for sets as for tables, we've failed in our hash function

specification. With sets you need no space for values, so that

again halves the space.

With these two factors you have my 1/4 estimate.

You can reduce this estimate further - it's typical to desire a load

factor around 75%. There is also the case of MIT Scheme which

uses 6-bit tags (presumably they would just be required to use

bignums).

Thirdly, the number of distinct keys in a hash table is of the
same order of magnitude as the number of buckets. Taking the
storage occupied by those distinct keys into account reduces the
number of buckets even further.

The keys could be fixnums. Since we can use negative fixnums

we already have more possible keys than the maximum number

of buckets.

Fourthly, there's no real problem with hash functions that use
bignum arithmetic. Even in Larceny, whose bignum arithmetic is
notoriously slow, it's hard to find any use of hash tables for
which the use of bignum arithmetic in hash functions would be
significant compared to the overhead of hash tables themselves.

I haven't run any benchmarks, but I think a simple lookup in a

hash table should not cons.

Regardless, if there is no explicit bound then as I said the SRFI

would need to clearly specify the implicit bound, presumably the

bignum UINT_MAX.

On 16 October, Alex Shinn wrote:

eq?/eqv? hash tables with rehashing on gc,
as in larceny, may be doing more work than
is needed for some common eq?/eqv? cases.

If the keys are only symbols, it would be
reasonable to store a hash value directly in
the symbol object, or even arrange for all
symbols to be stationary in the gc. To allow
such optimizations, I think we should add:

make-symbol-hash-table

Alex is assuming the implementation is remarkably stupid, but
Larceny's eq?/eqv? hash tables are not that stupid.

Yes, I already retracted this.

I still think we should include char-hash and number-hash, but

would not argue strongly for it.

Alex