Some preliminary benchmark results for insertion-ordered hash tables

Some preliminary benchmark results for insertion-ordered hash tables Daphne Preston-Kendal 20 Aug 2025 08:49 UTC

I have compared some very basic hash table operations between the SRFI 250 sample implementation and Chez Scheme’s implementation of (rnrs hashtables (6)), which is not insertion-ordered.

I have chosen Chez Scheme’s implementation because:
1. I was able to get the sample implementation running on it
2. Its native hashtables provide a fair comparison because they are written in Scheme and not in C
3. Chez also has a procedure for calculating the memory use of an individual object

Gambit, in theory, could at least be used as another point of reasonably fair comparison of speed (though it seems to lack a way to measure memory use), but I was not able to get the sample implementation running under its native R7RS libraries implementation, nor under Gerbil’s. Guile’s hash table situation involves some weird wrapping of an implementation that’s natively in C anyway. Hoot might be the next comparison I try.

These comparisons were done on an Apple M4. Both Chez’s standard library and the version of the sample implementation I used to compare results were compiled at Chez’s optimize-level 3, which disables type checks and makes the implementation unsafe. This was also done for fairness of comparison. Arguably it would have been even more fair to recompile Chez with its own standard libraries set to optimize-level 2 instead (which is safe in the sense of the R6RS), but I didn’t investigate how to do this. I also did some Chez-specific tuning (specifically, replacing some R7RS-compatible generic arithmetic operations with R6RS-only fixnum operations – I don’t know if I’ll add an option for this to the shipping version of the sample implementation).

Some results:

The SRFI 250 sample implementation uses half as much memory per hash table. This result is consistently observed across hash tables of various sizes (the precise figure varies between 47% and 52% ish). (This counts only the memory used by the hash tables themselves, not the keys and values within.)

The sample implementation’s hash-table-ref is slower (for completely random access to existing entries) than Chez’s hashtable-ref by about 1.3x. It varies from, at best, 1.1x slower on large hash tables (10,000 entries) to, at worst, 1.5x slower on smaller ones (100 entries). 1.3x is still about average for tables at both extremes of size.
To give an idea of how utterly ‘down in the weeds’ this difference is in practice, this is with a benchmark repeated 100,000,000 times, so about 0.0000000155 seconds per loop (which includes time spent in the random number generator used to generate a random key to look up) with Chez’s own hashtables to 0.0000000205 seconds with mine.

Adding new entries is divided into two types: when the capacity is correctly known and allocated in advance, and when it isn’t known and the hash table has to be continually resized upwards.
In the former category, the SRFI 250 sample implementation is 1.8x slower than Chez’s own implementation.

The resizing situation is a lot worse: my implementation is about 4.5x slower. However, this may be because of suboptimal growth factors on the entry vectors and/or the hash buckets: 3/2 and 4/3 respectively. The 3/2 factor for the entry vectors is fairly standard for dynamically-sized arrays in various garbage-collected languages; but 4/3 is very small for a dynamically-sized array by normal standards. I chose it for no other reason than because I copied PyPy, where the maximum load is set to 2/3 and the minimum (except when the table is brand new) to 1/2. That works out to a 4/3 growth factor.
Suboptimal calculation of the new sizes probably also plays a somewhat smaller role. (The advantage of 3/2 is that it can be calculated by (old_size + (old_size >> 1)), but my implementation currently does (floor (* old-size 3/2)) – generic arithmetic loses again. Same for calculating the new size of the buckets array.)
I haven’t yet experimented with changing these factors or calculating them in more efficient ways. (Even if 4/3 is roughly right for optimal load, it might be better to try, say, 11/8 as an approximation, and calculate it as (old_size + (old_size >> 2) + (old-size >> 3))).

Deletion from the front (i.e. of the least-recently-added entries) is indeed a *lot* slower with Hettinger tables than classical non-ordered hash tables. Deletion of random entries isn’t as bad, and deletion of entries from the end (i.e. most-recently-added entries) is speedy. Someone* has worked out how to deal with the slowdown from dealing with tombstones in Hettinger tables, though I think it will need some cleverness to use their trick in my implementation. More results to come :-)
(* sorry, I forgot where I read about it)

Conclusions:

SRFI 250 hash tables have significantly better memory usage than Chez’s native, unordered hash tables without significantly worse performance. In my opinion, the improved memory use makes up for the performance difference that remains.

Whether the remaining performance differences can be ironed out, I don’t know, and for -ref operations at least, I have given up making improvements for now.

Chez’s hash tables use chaining; my implementation uses open-addressing, which at these loads should be quicker according to the textbooks. But my implementation’s access to the actual entries is indirected through a separate array of buckets, which isn’t in the textbooks. In theory, that should make little performance difference (especially once the CPU cache is warm) — but maybe that is the cause.

Alternatively, maybe there are still some changes I could make to the sample implementation that would get -ref performance within a 0.9x–1.1x-ish kind of range. But Chez’s options for profiling at this level are pretty useless, and to me it’s not currently worth the trial-and-error guesswork any more to find out where I can squeeze another 25% performance out.

If performance of insertion without regular reallocation could be improved, that implies that performance of insertion with reallocation would also improve, likely by more than a corresponding linear factor. (Initial measurements before insertion was optimized slightly were 2x slower for non-growing insertions, 7x slower for growing insertions; just the improvement to 1.8x for growing insertions also sped up growing insertions to 4.5x slower than Chez’s baseline, rather than the 5.5x–6.5x one might naïvely expect.)
However, again, I feel I’ve run up against the limitations of Chez’s profiling here.

I suspect that a more optimal strategy for dealing with or avoiding hash collisions might help with this particular benchmark (the profile I have shows that in the benchmark, about 40% of probes led to a re-probe), for -ref and -add! performance alike.

Finally, my implementation checks the type of every entry before allowing a -ref or -add! operation, whereas R6RS’s API design delegates that task exclusively to the hash function and equality check. Removing the type test showed an utterly negligible performance difference on these benchmarks.

Daphne