John Cowan wrote: > > I'm basically saying the same argument holds for circular lists > > as well as for vectors. The argument would also hold for any > > other structures that might be circular, such as procedures > > whose representations involve circularity (as is often the case > > because that's a standard way to implement recursion and scope). > > In that case, all the objects affected are out of scope for SRFI 128, > and you can do whatever you like: the rules given in the SRFI don't apply. The problem is that I can't do whatever I like on vectors and lists until I have determined that the argument to default-hash is circular, which is an expensive determination. Making the default-hash procedure even slower on most structured arguments is not a realistic option. By the way, I was able to get the sample implementation of SRFI 128 to run to completion on my benchmark by using Larceny's regional collector, which is more resilient against near-worst-case storage allocation. Here are the numbers measured over all bytevectors, non-circular vectors, and non-circular lists found within Larceny's R7RS heap: collisions (%) rms/bucket nanoseconds/hash bytevectors 62.0% 31.10 1209 R6RS equal-hash 62.0% 31.10 1206 SRFI 126 equal-hash 55.9% 31.06 251213 SRFI 128 default-hash 55.9% 31.06 241068 SRFI 125 hash-by-identity vectors 36.9% 7.04 705 R6RS equal-hash 36.9% 7.04 719 SRFI 126 equal-hash 31.2% 7.02 430671 SRFI 128 default-hash 31.2% 7.02 436762 SRFI 125 hash-by-identity lists 57.9% 20.79 552 R6RS equal-hash 57.9% 20.77 700 SRFI 126 equal-hash 48.7% 29.06 374208 SRFI 128 default-hash 48.7% 29.06 385643 SRFI 125 hash-by-identity The first column shows the percentage of hash collisions, but the second column is more important. That second column shows the root mean square number of objects per bucket when the number of buckets is equal to the number of objects. (If those numbers are higher than you'd expect, please remember that all empty vectors and bytevectors must have the same hash code, and there are multiple copies of many other short bytevectors, vectors, and lists lying around in Larceny's R7RS heap.) As can be seen from the third column, the SRFI 128 and SRFI 125 sample implementations are two to three decimal orders of magnitude slower than SRFI 126 without delivering any meaningful improvement. By not guaranteeing termination on all arguments, the hash functions specified by SRFI 128 and SRFI 125 actually deliver *less* at a far higher cost. Some of the poor performance you're seeing in that table may be due to the way SRFI 128's sample implementation is written, but a lot of that poor performance is attributable to the SRFI 128 mustard. I didn't realize the mustard was so bad until I tried to fix the performance problem and realized the obvious fix was blocked by SRFI 128 mustard. > > If that's the intent, then the SRFI 128 mustard was intended to > > preclude efficient implementation of an always-terminating version > > of default-hash, because the standard way to achieve an efficient > > always-terminating hash function looks at only a bounded portion > > of its argument. > > Do I understand correctly that when given a non-circular vector, the > SRFI 126 hash procedure examines only at most a fixed number of elements? You are conflating specification with implementation. SRFI 126 does not contain any mustard that requires its equal-hash procedure to examine every part of its argument. SRFI 126 allows equal-hash to examine only part of its argument (but does not require that behavior). SRFI 126 does require equal-hash to terminate on all arguments, however, and the standard way to accomplish that is to hash on only a bounded portion of the argument. As a matter of current practice, most (if not all) implementations of (rnrs hashtables) export an implementation of equal-hash that examines only a bounded portion of any argument that might be circular. The sample implementation of SRFI 126 defines equal-hash to be the same as R6RS equal-hash whenever the (rnrs hashtables) library is available. (It ought just to re-export that binding, but that's a different matter.) That means there are some systems in which the sample implementation of SRFI 126 defines an equal-hash that examines only a bounded portion of arguments that might be circular, but there might also be a few systems in which the sample implementation of SRFI 126 defines an equal-hash that uses the much slower technique of examining every element of every nested vector, using hash tables to avoid non-termination. Of course, there might also be some systems that try to use the sample implementation of SRFI 126 on top of the sample implementation of SRFI 69, yielding an equal-hash that doesn't always terminate, but that implementation technique is forbidden by SRFI 126 mustard: "Like equal?, the equal-hash procedure must always terminate, even if its arguments contain cycles." > That seems to me problematic. The original version of Java's String#hash > function examined at most 16 characters of the string being hashed, > which tended to cluster results badly. Consequently, Java 1.2 replaced > it with a function that examined all the characters, essentially similar > to the SRFI 128 sample implementation. Undoubtedly the new algorithm > was slower in many cases, but it delivered better results. Strings can't be circular, so SRFI 126 equal-hash can't get into trouble by examining every element of a string. Examining only 16 characters of a string is undoubtedly problematic, but basing the hash upon some bounded number of characters (say 128) should work pretty well and would run faster on extremely large strings. For proof, see the numbers for bytevectors in the table above: Larceny's implementation of R6RS and SRFI 126 equal-hash examines only a bounded number of a bytevector's elements (roughly 35 IIRC), but the quality of its hashes is clearly on par with the hashes computed by the sample implementations of SRFI 128 and 125. That's beside the point, of course, because there is no danger of non-termination when hashing strings or bytevectors. If you're using Java specifications as precedent, you should note that Java does not require its hashCode() methods to examine every part of the object being hashed, and does not require the values returned by x.hashCode() to be a mathematical function of the hash values of x's components. So far as I know, that mustard is unique to SRFI 128. I guess it's too late to repair SRFI 128, but I would urge Working Group 2 to avoid these mistakes in R7RS-large. We might still have time to repair SRFI 125. Instead of strongly encouraging use of SRFI 128 comparators, SRFI 125 could strongly encourage use of SRFI 1xx comparators, where SRFI 1xx is a SRFI not yet written that would be almost identical to SRFI 128 but omit the problematic mustard. Another alternative is for SRFI 125 to deprecate the problematic mustard of SRFI 128, but that might be going too far in the direction of trying to make substantive changes to a finalized SRFI. Will