On Sun, Nov 29, 2015 at 10:52 PM, Shiro Kawai <xxxxxx@gmail.com> wrote:
Regarding global seed.  If one wants to use hash function for persistent data
(here I mean the data that are exchanged beyond process boundary), we need
to explicitly set the seed.  Wrapping calls to hash fn by parameterize may do,
but passing seed argument will be much cleaner (and probably less overhead;
the seed must be at least thread-local).

One suggestion for persistence was to use an env variable to
initialize the value.

But this only gives the weak-security benefit of a seed.  It doesn't
allow any of the additional applications discussed:

  * per-table seed (for better security/performance)
  * distributed/disk-based hash tables (assuming no bounds arg)
  * dynamic perfect hash tables
  * bloom filters
  * minhash

For these we need an explicit argument.

-- 
Alex

On Sun, Nov 29, 2015 at 6:26 PM, John Cowan <xxxxxx@mercury.ccil.org> wrote:
Alex Shinn scripsit:

> This assumption is plainly wrong as it depends on the actual range
> of fixnums.

However, R6RS mandates at least 24-bit fixnums, and looking at FixnumInfo,
we see that all the widely used Schemes have at least 29-bit fixnums.
2^28-1 is a *lot* of buckets on a 32-bit machine.  There are a few
Schemes that have the same size fixnums on both 32-bit and 64-bit builds
(Elk, IronScheme, Larceny, Mosh, RScheme), but I imagine it wouldn't
be horrendously hard to change that, except perrhaps for IronScheme
because of its CLR dependency.

> What makes hash functions special is that they form a contract
> interacting with the hash table size limit, a contract which is exposed
> to third parties who can write their own hash functions.  The concern
> then is that we could inadvertently make it _impossible_ to write a
> hash table implementation which supported certain hash tables which
> would otherwise be possible, i.e. we could be forcing an artificial
> upper bound.

Well, yes.  If a user-written hash function never delivers a result
larger than 2^20, then hash tables that would naturally require more
than 2^20 buckets aren't going to work properly.  I don't see how
you can avoid this.

> but a majority of Schemes use 2 bits,

(Technically only a plurality:  14 out of 39 in 32-bit mode, and in 64-bit
only 8 out of 35.)

> But I'm serious about the seed arguments.

I am too, but I'm not convinced that an argument is the correct strategy.
I still incline to the idea of a global seed that is set randomly before
any user code executes.  This disallows hyper-paranoid implementations
with a distinct seed per hash table, but makes management of hash
functions outside of hash tables much simpler.

--
John Cowan          http://www.ccil.org/~cowan        xxxxxx@ccil.org
Híggledy-pěggledy / XML programmers
Try to escape those / I-eighteen-N woes;
Incontrovertibly / What we need more of is
Unicode weenies and / François Yergeaus.