On Tue, Nov 10, 2015 at 2:25 AM, William D Clinger <xxxxxx@ccs.neu.edu> wrote:

On 30 September, Alex Shinn wrote:

Re: the bound argument, it's a promise that we don't

need values larger than the bound, allowing the hash

function to use fixnum-only arithmetic in most cases. It

applies not just in special cases like disk-based tables, but

when the hash table consumes a large enough fraction

of memory to want a bignum number of buckets. As

discussed on the SRFI-125 list, this can happen on

32-bit architectures with 3 tag bits when the table takes

around 1/4 of available memory. It's realistic to want

a small server where the table takes up nearly 100%

of memory.

That reasoning fails for three or four separate reasons.

In SRFI 126, the bound argument is not a promise about anything.

Yes, the SRFI 126 specification of bound is not useful.

If specified at all, it should be required, and should provide

the guarantee that the same bound be consistently used

for all elements in any given table (i.e. changing happens

only with a rehash).

Secondly, a 32-bit machine implies buckets occupy at least 8 bytes

(to hold 32-bit representations of a key and a value). With 3-bit

tags, the largest positive fixnum is likely to be 536870911.

If you had that many 8-byte buckets on a 32-bit machine, you'd

have a grand total of 4 bytes left over for other purposes.

Did you see my computation on the SRFI 125 mailing list which

gave the 1/4 available memory?

First, with 3-bit tags the max fixnum is 2^28-1 = 268435455,

half your value... and half the memory, which already means

we have more than enough remaining memory to implement

something like memcached.

Second, while the arguments I made for disk-based tables and

more general hashing are debatable, but we should consider

the simple case of hash-sets. If we can't use the same hash

functions for sets as for tables, we've failed in our hash function

specification. With sets you need no space for values, so that

again halves the space.

With these two factors you have my 1/4 estimate.

You can reduce this estimate further - it's typical to desire a load

factor around 75%. There is also the case of MIT Scheme which

uses 6-bit tags (presumably they would just be required to use

bignums).

Thirdly, the number of distinct keys in a hash table is of the

same order of magnitude as the number of buckets. Taking the

storage occupied by those distinct keys into account reduces the

number of buckets even further.

The keys could be fixnums. Since we can use negative fixnums

we already have more possible keys than the maximum number

of buckets.

Fourthly, there's no real problem with hash functions that use

bignum arithmetic. Even in Larceny, whose bignum arithmetic is

notoriously slow, it's hard to find any use of hash tables for

which the use of bignum arithmetic in hash functions would be

significant compared to the overhead of hash tables themselves.

I haven't run any benchmarks, but I think a simple lookup in a

hash table should not cons.

Regardless, if there is no explicit bound then as I said the SRFI

would need to clearly specify the implicit bound, presumably the

bignum UINT_MAX.

On 16 October, Alex Shinn wrote:

eq?/eqv? hash tables with rehashing on gc,

as in larceny, may be doing more work than

is needed for some common eq?/eqv? cases.

If the keys are only symbols, it would be

reasonable to store a hash value directly in

the symbol object, or even arrange for all

symbols to be stationary in the gc. To allow

such optimizations, I think we should add:

make-symbol-hash-table

Alex is assuming the implementation is remarkably stupid, but

Larceny's eq?/eqv? hash tables are not that stupid.

Yes, I already retracted this.

I still think we should include char-hash and number-hash, but

would not argue strongly for it.

--

Alex