Re: Discussion about your comments Sebastian Egner 21 May 2005 16:20 UTC

Panu wrote:
> ... bounds or no bounds to hash functions?
>
> I don't get this argument.  For the modular
polynomial method to work,
> it suffices for the mantissa and modulus to be
relatively prime. This
> can achieved easily by picking a prime modulus, but
also by picking
> a prime mantissa.

No. The modular polynomial method works if and only if
the mantissa
has sufficient multiplicative order modulo m. Example:
I choose
modulus m = 120 and you are to combine five hash
functions
h_0, .., h_4 using the polynomial method into

  h = x -> h_0(x) + r h_1(x) + r^2 h_2(x) + .. + r^4
h_4(x) mod m.

How do you choose r to get a decent hash function?

The sad answer: You don't. Reason: Units(Z/120*Z) =
Z2^3 x Z4.
There are no elements of order 5 or more modulo 120,
which means
h effectively just *adds* at least two of the
component functions
no matter how you choose r.

If you want another example: Load your own reference
implementation,
construct a hash table using (make-string-hash-table
37) and insert
a couple of strings that end in the same character.
You will find
that they all end up in the same bin. Why? Because I
have chosen a
bad modulus m (by giving size hint 37) and string-hash
uses the
modular polynomial method with r = 37, which is 0 mod
m.

> I would have thought it to be more general if the
hash
> function gets to know the modulus on beforehand, not
more restrictive.

In the end, both approaches are equivalent. In your
current setup
the target range is communicated to the hash function
with every
call. The hash function needs to produce a value in
that range,
but it can call helper hash functions with other
ranges (i.e. a
large prime). In the other approach, there is a single
large
range (probably a prime close to but not exceeding the
maximal
immediate integer) known to every hash function
beforehand---and
it is the task of the hash table to reduce it to the
range it needs.

> That said, I don't have much reason for requiring
the bound parameter.
> The most important reasons for it were (1) the
observation that the hash
> seems to almost always be converted to some range
anyway, so (hash obj
> bound) seems cleaner than (modulo (hash obj) bound),
and (2) the bonus
> of getting arbitrarily big hash ranges on request
(with a fixed range,
> it is easy to reduce it but not extend it).

Correct, there is also not so much reason to have no
bound parameter.
In the end it is about efficiency, and easily defining
good hash
functions is probably the key for that. (Not sure, (2)
is a bonus;
bignums can hurt performance quite a bit here.)

> As any non-bounded hash function h1 can be made into
bound-aware hash
> function h2 by
>
> (define (h2 obj bound) (modulo (h1 obj) bound))
>
> one can't say that it makes it much more difficult
to define bound-aware
> hash functions than boundless ones.  Am I missing
something?

"Unbounded hash functions" sounds like a bad idea to
me,
I am proposing a global statically-known default range
that
is implementation-dependent but prime and sufficiently
large.

(Aside: Is it on purpose that *default-bound* in the
reference
implementation is not a prime? m = 2^29 - 3 would be
prime and
r = 256592988 is a primitive element mod m with some
interesting
bit-pattern; r = 2 and r = 2^16 + 1 are others. Your
choice is
m = 2^29 - 2 and r = 37, which only has order 1008 mod
m.)

(Another aside: It might be worth it using SRFI-16
[case-lambda]
in the reference implementation.)

---

> hash-table-get, -put!, -remove!:
>        These are provided as compatibility
procedures to ease porting
>        of non-SRFI-69 code onto SRFI-69?  If you
think this is not a
>        worthwhile goal, I can purge them from the
SRFI.  -ref, -set!,
>        -delete! are IMO clearly better but less
used.  I added a "don't
>        use this" note to the compatibility versions.

If you want to get rid of these procedures in legacy
code you can
better do it in this SRFI. It might be the only chance
there is in
a long time.

---

> order independence of -keys and -values:
>         This is not because of prospective
performance merits, but to
>         discourage bad programming style.  If one
wants to process the
>         association pairs of a hash table, one
should not be using -keys
>         and -values.

I am afraid that the pragmatics is exactly the other
way around:
By not specifying -keys and -values to be in sync, you
do not
discourage anybody from anything. It is more an
invitation to
the user to hurt himself with some unexpected
unportable feature.

It is also not obvious to me that using -keys and
-values in sync,
provided they are specified to be, is "bad programming
style."
What is the harm?

"Unspecified" in a standard usually means "I would
like to give
choice to the implementors." If you really want to
discourage people
from assuming that -keys and -values are (map car
(hash-table->alist h))
and (map cdr (hash-table->alist h)) then you can
better *specify*
-keys and -values to disagree: Then programs assuming
-keys and
-values being in sync fail right away, and not at the
time they are
ported to some other implementation of this SRFI.

  Sebastian.

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com