string hash function in SRFI-13 shivers@xxxxxx 16 Dec 2000 03:03 UTC

Brad Lucier pointed out that my reference implementation of string-hash
(and char-set-hash) doesn't match the comments above it. Here's an
explanation of the issue.

The basic idea for the string-hash function (this is not part of the spec;
it's just the particular implementation chosen for the reference
implementation) is taken from Java. We treat the string cn...c0 as
a radix-37 numeral (albeit one where the individual digits can be
larger then 36), and use that for the hash value:

    h = (c0 + c1*37 + c2*37^2 + ...) mod bound

where BOUND is the desired range of the hash function. Now, the expression
in parens is going to be a a really big number for a long string; we'd like
to compute H without using a bignum package. We *could* do this:

    h := 0
    for i = n to 0 do
      h := (c[i] + h*37) mod bound

That's a lot of mod operations, however. Instead, why not define MASK
to be a string of 1-bits long enough to "cover" BOUND, that is
   mask := 2^(ceiling(lg bound)) - 1
and then just keep these low bits in the intermediate computation, finishing
up with a final mod op:

    h := 0
    for i = n to 0 do
      h := (c[i] + h*37) and mask
    h := h mod bound

Great -- that's much more efficient. But it ain't what we originally set out
to compute -- it isn't computing the h value I originally wrote down. That
is what Brad spotted (which I had missed).

The question is: does it matter? Is this more-quickly-computed value
any less good as a hash value?

I don't think so. I have not thought it through carefully, however.

Anybody else have anything else to say?
    -Olin