How to specify my hashing requirements?

I am implementing a few algorithms and data structures that internally hash their keys.

The requirement I have is that equal values (according to Eq) hash the same way, and unequal values hash in different ways. In other words, equal values feed the same serialized stream to a Hasher, and different unequal values feed different streams to a Hasher. This doesn't mean that they necessarily produce different hash values, just that they give Hasher a chance to produce different values.

I am also implementing a Hasher that is specialized to my data structures. My data structures will detect hash collisions and deal with them.

However, if there were a situation that Hash feeds the same stream for unequal keys, then that would cause my data structure to go into an infinite loop trying forever to deal with it and not being successful. Or, I could add extra checks so that it will panic in that scenario.

So Hash is almost the trait that I can use.

However, the documentation of Hash says that equal values "must" produce the same streams, but for unequal values it only says that they "should" produce different streams, so it's not a strict requirement.

So what would you recommend:

  1. Define a different trait (UniqueHash or something) that has that stricter requirement? The downside is that I can't reuse existing implementations of Hash. And in practice for almost all types, the implementation of UniqueHash would be identical to the implementation of Hash.

  2. Make UniqueHash a marker trait that derives from Hash? Users would have to explicitly opt in their key types.

  3. Use Hash and just document that if it doesn't have the uniqueness property then my data structure will hang forever or panic?

That's the approach I'd take. If you implement it for most std types (strings, integers, slices/vectors of UniqueHash types, etc.) then the vast majority of key types will just work.

Right. That was my preference too. My slight worry is that technically std doesn't document that Hash actually has this property for primitive types and standard strings and collection types. But I think it does in reality.

I guess one thing you could do is add tests which make sure (for example) hashing a &str will receive the same sequence as bytes as hashing the &[u8] you get from std::as_bytes() (or to_ne_bytes() for integers, or whatever).

As long as those sorts of tests pass, your UniqueHash implementation is valid.

The marker trait approach is certainly workable, although it's only slightly better than merely documenting the additional requirement.

However, I think it might be worth taking a step back. It seems to me like you aren't really worried about the hashes colliding, but about the equality of the values themselves. My question is: why are you trying to constrain the sequence fed to the hasher and then effectively recover the identity of the value from it? If different values are required to produce different sequences, then there is a bijection between values and hash sequences, i.e. you don't gain anything over inspecting the value itself. Given that you also have access to the value, couldn't you simply compare the values themselves in case of a collision?

For performance. I don't want to linearly scan for my keys using Eq.

Instead, I use another random instance of my Hasher for such colliding keys, which ends up having performance guarantees (O(1) access time). Think: perfect hashing.

Hmm, it seems like I don't understand the problem yet. Why would you need to compare a value to all other values using linear search, i.e. how would my proposal change the implementation?

I unfortunately don't see how perfect hashing compares to your requirements. AFAIK "perfect" hashing refers to the fact that a suitably chosen hash over a known, small domain can be constructed in a way that it's guaranteed collision-free yet it compresses significantly. I'm not familiar with how that can be achieved by two different, random hashers – after all, it's concievable (although of quadratically low probability) that some key(s) will produce hash collisions with both.

Well imagine somebody gives you a Hash implementation that doesn't feed anything to the Hasher. How do you find a key quickly among a million such keys?

It can be done for UniqueHash but not for such poor Hash.

There are several such algorithms. One of the simplest is called "FKS hashing". But all these methods only work for UniqueHash, not for any old Hash.

To quote the Wikipedia article:

Ah, so they essentially defined away the problem :sweat_smile:

Right. But what I'm still missing is how do you profit from being aware of the unicity of the hashing sequence in this case? If you have to compare the sequences, it seems to me it doesn't matter that there is only ever one of instance of a dumb/malicious key that produces an empty sequence; you still have to compare each of them in order to find the sequence that matches whatever happens to be produced by the key, whether it's empty or not, whether it's unique or not.

No. Wikipedia explains how to deal with collisions. If the Wikipedia explanation doesn't work for you, you can read their paper that is linked there. It's a widely cited algorithm too.

If Hash is unique, then that scenario in not possible -- you can't have a million keys that all feed the same thing to the Hasher. So you only have to compare at most one using Eq, not a million.

In general, the idea is that if the hash function doesn't work well, you select a different instance of a Hasher, until one works well. For UniqueHash you're guaranteed that you'll soon find one that works well. Otherwise, you will never find one.

Right, so it's probabilistic all the way down, and potentially more than 2 levels/hashers are allowed. That does make sense.

Actually no, the FKS structure only has 2 levels. Yes it's a randomized algorithm. But at lookup time it's deterministic O(1) time in the worst case.

You can also read about it on my blog.

But that was just an example. There are other data structures that benefit from having unique hashes. There is universal hashing (with its own guarantees), there is dynamic perfect hashing, etc. All of these require UniqueHash to work as advertised.

1 Like