I am implementing a few algorithms and data structures that internally hash their keys.
The requirement I have is that equal values (according to Eq) hash the same way, and unequal values hash in different ways. In other words, equal values feed the same serialized stream to a Hasher, and different unequal values feed different streams to a Hasher. This doesn't mean that they necessarily produce different hash values, just that they give Hasher a chance to produce different values.
I am also implementing a Hasher that is specialized to my data structures. My data structures will detect hash collisions and deal with them.
However, if there were a situation that Hash feeds the same stream for unequal keys, then that would cause my data structure to go into an infinite loop trying forever to deal with it and not being successful. Or, I could add extra checks so that it will panic in that scenario.
So Hash is almost the trait that I can use.
However, the documentation of Hash says that equal values "must" produce the same streams, but for unequal values it only says that they "should" produce different streams, so it's not a strict requirement.
So what would you recommend:
Define a different trait (UniqueHash or something) that has that stricter requirement? The downside is that I can't reuse existing implementations of Hash. And in practice for almost all types, the implementation of UniqueHash would be identical to the implementation of Hash.
Make UniqueHash a marker trait that derives from Hash? Users would have to explicitly opt in their key types.
Use Hash and just document that if it doesn't have the uniqueness property then my data structure will hang forever or panic?
Right. That was my preference too. My slight worry is that technically std doesn't document that Hash actually has this property for primitive types and standard strings and collection types. But I think it does in reality.
I guess one thing you could do is add tests which make sure (for example) hashing a &str will receive the same sequence as bytes as hashing the &[u8] you get from std::as_bytes() (or to_ne_bytes() for integers, or whatever).
As long as those sorts of tests pass, your UniqueHash implementation is valid.
The marker trait approach is certainly workable, although it's only slightly better than merely documenting the additional requirement.
However, I think it might be worth taking a step back. It seems to me like you aren't really worried about the hashes colliding, but about the equality of the values themselves. My question is: why are you trying to constrain the sequence fed to the hasher and then effectively recover the identity of the value from it? If different values are required to produce different sequences, then there is a bijection between values and hash sequences, i.e. you don't gain anything over inspecting the value itself. Given that you also have access to the value, couldn't you simply compare the values themselves in case of a collision?
Hmm, it seems like I don't understand the problem yet. Why would you need to compare a value to all other values using linear search, i.e. how would my proposal change the implementation?
I unfortunately don't see how perfect hashing compares to your requirements. AFAIK "perfect" hashing refers to the fact that a suitably chosen hash over a known, small domain can be constructed in a way that it's guaranteed collision-free yet it compresses significantly. I'm not familiar with how that can be achieved by two different, random hashers – after all, it's concievable (although of quadratically low probability) that some key(s) will produce hash collisions with both.
Right. But what I'm still missing is how do you profit from being aware of the unicity of the hashing sequence in this case? If you have to compare the sequences, it seems to me it doesn't matter that there is only ever one of instance of a dumb/malicious key that produces an empty sequence; you still have to compare each of them in order to find the sequence that matches whatever happens to be produced by the key, whether it's empty or not, whether it's unique or not.
No. Wikipedia explains how to deal with collisions. If the Wikipedia explanation doesn't work for you, you can read their paper that is linked there. It's a widely cited algorithm too.
If Hash is unique, then that scenario in not possible -- you can't have a million keys that all feed the same thing to the Hasher. So you only have to compare at most one using Eq, not a million.
In general, the idea is that if the hash function doesn't work well, you select a different instance of a Hasher, until one works well. For UniqueHash you're guaranteed that you'll soon find one that works well. Otherwise, you will never find one.
But that was just an example. There are other data structures that benefit from having unique hashes. There is universal hashing (with its own guarantees), there is dynamic perfect hashing, etc. All of these require UniqueHash to work as advertised.