string hash function in SRFI-13
shivers@xxxxxx 16 Dec 2000 03:03 UTC
Brad Lucier pointed out that my reference implementation of string-hash
(and char-set-hash) doesn't match the comments above it. Here's an
explanation of the issue.
The basic idea for the string-hash function (this is not part of the spec;
it's just the particular implementation chosen for the reference
implementation) is taken from Java. We treat the string cn...c0 as
a radix-37 numeral (albeit one where the individual digits can be
larger then 36), and use that for the hash value:
h = (c0 + c1*37 + c2*37^2 + ...) mod bound
where BOUND is the desired range of the hash function. Now, the expression
in parens is going to be a a really big number for a long string; we'd like
to compute H without using a bignum package. We *could* do this:
h := 0
for i = n to 0 do
h := (c[i] + h*37) mod bound
That's a lot of mod operations, however. Instead, why not define MASK
to be a string of 1-bits long enough to "cover" BOUND, that is
mask := 2^(ceiling(lg bound)) - 1
and then just keep these low bits in the intermediate computation, finishing
up with a final mod op:
h := 0
for i = n to 0 do
h := (c[i] + h*37) and mask
h := h mod bound
Great -- that's much more efficient. But it ain't what we originally set out
to compute -- it isn't computing the h value I originally wrote down. That
is what Brad spotted (which I had missed).
The question is: does it matter? Is this more-quickly-computed value
any less good as a hash value?
I don't think so. I have not thought it through carefully, however.
Anybody else have anything else to say?
-Olin