Re: should hash function upper bound be exclusive?

Show/hide message thread

should hash function upper bound be exclusive? Per Bothner (09 Aug 2005 17:31 UTC)

Re: should hash function upper bound be exclusive? bear (10 Aug 2005 00:14 UTC)

Re: should hash function upper bound be exclusive? Per Bothner (10 Aug 2005 04:52 UTC)

Re: should hash function upper bound be exclusive? Panu Kalliokoski (10 Aug 2005 06:36 UTC)

Re: should hash function upper bound be exclusive? Per Bothner 10 Aug 2005 04:52 UTC

bear wrote:
> On the other hand, the largest fixnum is likely to be either a power
> of two, or one less than a power of two; as hashes are conventionally
> implemented (via modulus operations) these are poor choices for a hash
> range boundary.  Numerical theory indicates  that a much better choice
> is the highest prime fixnum representable, or at least a fixnum with
> no small factors.

I'm afraid I don't see your point.  A "default bound" is a generic
hash value that has a larger range than indexes for a hash table,
which means you *will* be doing a modulo operation to convert the
generic hash value to an index.

Yes, some people recommend that that *the number of buckets* be a
prime number, but I don't how well-founded that is.  If the hash
function is properly "random", does it matter?  This faq:
http://burtleburtle.net/bob/hash/hashfaq.html says:
   Anything with a modulo prime at the end is probably bad; a decent
   hash function could use a power of two and wouldn't need to rely on
   the modulo prime to further mix anything.

> Otherwise you run into rehash problems where the
> system may be able to find no valid rehash value even though the table
> is not full.

If by "rehash" you mean linear probing (secondary hashing, IIRC),
then obviously the sequence of probes should be exhaustive,
and one easy way to do that is to have an increment that is
relatively prime to the number of buckets.  Howwever, if each bucket
contains a chain (as in the referernce implementation) that's
not an issue.

If by "rehash" you mean resizing (increasing the number of buckets),
I still don't see the relevance.

> Definitely; when you have the capability of resizing the table, you
> must be able to rehash everything in the table.  Usually you can make
> that much faster by caching the hashes.  Caching the hashes also makes
> it much faster (in the average case) to check and make sure that you
> are looking at an actual *match* rather than something else that
> hashed into the same bucket.

If we can resize (and don't want to recalculate hashes when resizing)
then the cached hash needs to be table-size-independent.  It should use
an implementation-tuned range, and a reasonable choice matches the
fixnum range (or the non-negative fixnum range).  It also makes
sense that this range match the default bound when the hash functions
are called with only one parameter.
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/