Make Hash return same value whather the order of element of a tuple

I'm trying to implement a tuple struct with two elements so that MyTuple(a, b) and MyTuple(b, a) are considered to be the same when used in a HashSet.
I'm wondering how I can implement the Hash trait to return the same value whatever the order of the elements.

Are they orderable? You could hash the lesser value first. Make sure MyTuple(a, b) and MyTuple(b, a) compare equal under ==, too.

Playground.

Alternatively you could use privacy and always store them ordered as a logical invariant, and then just derive these traits.

6 Likes

This might be hard (or impossible) to do. Is your tuple struct generic or is this about concrete types? In the latter case, you could sort the two values first, then hash in sorted order.

This fact that it's hard to make hashing ignore order is related to why it's hard to implement Hash for a HashMap itself. (The standard library just doesn't do it at all.) Note that in order to use one Map as a key for another Map, you can use BTreeMaps instead, since they rely on Ord and implement Ord themself, too.

If your struct is generic, implementing Ord for a MyTuple struct in a way that order doesn't matter is probably way easier than implementing Hash; you very just can use the "sort the two fields first" approach since you do have an Ord bound already (and then use lexicographic order on the sorted tuple). So if Hash seems like too much of a hassle / you (and no-one in this thread) has a good idea how to do it, consider implementing Ord like this and using a BTreeMap :wink:

2 Likes

I think it's correct to hash the two fields independently, and combine their resulting hash with an xor.

use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

struct MyTuple(i32, i32);

fn calculate_hash<T: Hash>(t: &T) -> u64 {
    let mut s = DefaultHasher::new();
    t.hash(&mut s);
    s.finish()
}

impl Hash for MyTuple {
    fn hash<H: Hasher>(&self, state: &mut H) {
        let h = calculate_hash(&self.0) ^ calculate_hash(&self.1);
        state.write_u64(h);
    }
}

fn main() {
    dbg!(calculate_hash(&MyTuple(12, 99)));
    dbg!(calculate_hash(&MyTuple(99, 12)));
}
1 Like

This violates the principle that Hash should write different things to Hasher when values are not equal. For instance, MyTuple(3, 3) would write the same thing as MyTuple(5, 5).

2 Likes

This also cheats the user of Hash into believing they choose the Hasher when in reality you chose hash_map::DefaultHasher for them.

2 Likes

If the two elements implement both Ord and Hash, then it's easy: you first call hash on the smaller element, then on the larger.

Edit: @quinedot already said essentially that.

For anyone interested, look e.g. at the discussion in this past (closed) PR that tried adding such an implementation

Without Ord, I can do something that almost works correctly:

  1. Feed a and b to a strong, cryptographic Hasher.
  2. Sort the crypto hashes.
  3. Feed the sorted hashes to the given Hasher.

I say it "almost" works, because it relies on the crypto Hasher never having collisions, which is only an approximation, of course. For instance, imagine the user-provided Hasher is an even better crypto Hasher of their choice -- again they would be cheated about the guarantees.

1 Like

Coincidentally, I made an example of exactly that in https://doc.rust-lang.org/std/hash/trait.BuildHasher.html#method.hash_one.

2 Likes

Where this principle is stated? You have to write the same things to Hasher if values are equal, but I don't see why it would be a requirement to write different things when they are different.

You can ask for Hasher + Clone instead. DefaultHashes is Clone and I suspect most others should be Clone, too.

How does it rely on lack of collisions? Collisions are perfectly fine in such a scheme, you may only have trouble when Hasher produces different results for the same input, but then it's not hasher at all.

Why can't you use user-provided Hasher?

AFAICT, it states an even stronger principle here:

Prefix collisions

Implementations of hash should ensure that the data they pass to the Hasher are prefix-free. That is, unequal values should cause two different sequences of values to be written, and neither of the two sequences should be a prefix of the other.

For example, the standard implementation of Hash for &str passes an extra 0xFF byte to the Hasher so that the values ("ab", "c") and ("a", "bc") hash differently.


You can't. If you implement Hash::hash, you must introduce the same kind of generic H: Hasher parameter as in the trait declaration, no way to specify additional bounds.

The reason is the same guarantee I cited above: You should not pass the same data to the Hasher for values that aren't equal.

A assume this question assumes an approach using something like Hasher + Clone again (which doesn't work, as explained above)? If not, you'll probably need to specify more clearly how else you imagine a user-provided hasher could be used.

1 Like

That prefix-free "should" is still just a suggestion though, not a requirement. Hash still works if you do a poor job of avoiding collisions, as long as it's consistent with Eq.

1 Like

Well if we ignore the collision avoidance property / suggestion, there is an easy solution to the original problem that satisfies all the remaining requirements:

impl Hash for MyTuple {
    fn hash<H: Hasher>(&self, state: &mut H) {
        // done
    }
}

This seems tongue in cheek, but I don't think it's reasonable to try to avoid some collisions and ignore others, unless you are targeting a specific input data distribution (and then you have to know what that distribution is).

For instance, specifically, you might do @olivren's xor thing that I was commenting on, that technically satisfies all the hard requirements, but if your data often has a == b, then that's pretty much as good as the empty solution.

I.e. no matter how sophisticated the HashMap implementation is and how good a Hasher you supply to it, you'll get a lot of collisions and O(n) rather than O(1) HashMap behavior.

1 Like

Right.

But even if it was possible somehow, there is another issue: by using a second Hasher inside Hash, even if it's the same type of Hasher, you've just doubled the collision probability that the Hasher provides. You now get a collision if either copy generates a collision.

This might be fine for some use cases, but not for others, so probably not a good idea for a general library.

Nope. Probability haven't changed because one collision is not enough. You either need two or you need collision after mixing them up.

Of course you can imagine hashes where such combination is not desirable, but if hash is good probability shouldn't change.

The fact that we are dealing with Rust and not C++ is a bit of bummer, though, simple scheme with Hash + Clone indeed is not possible in Rust.

Imagine you're computing hash(a, b) = H(H(a) | H(b)) (where | is the concatenation operator).

Let's say your two values are (1, 2) and (1, 3).

hash(1, 2) = hash(1,3) if either one of the following is true:
a) H(2) = H(3), or
b) H(c) = H(d), where c ≠ d, c = H(1) | H(2), d = H(1) | H(3),

So, double the probability.

If you think of H as a random function, and you're hoping for a collision, you have two attempts: first you check if H(2) = H(3), and if that fails, you get another chance at H(c) = H(d).

Using xor instead of concatenation only makes it worse. Better to pass both hashes (sorted) than to xor them, because xoring loses information, e.g. it makes hash(1,1) = hash(2,2) collide with 100% probability regardless of H.

Right. But you are comparing it to probability of collision for just a “normal”, “regular” hash which doesn't produce the same values for MyTuple(a, b) and MyTuple(b, a)!

Any hash with this property would have “increased probability” of collisions! Consider “perfect random hash” (essentially assign random value for any bit sequence and remember it, it's the best you can ever do). If you only have, e.g., 20 or 30 bits of input then there would be 2²⁰ or 2³⁰ outputs, but if you, now, add a requirement that Hash(MyTuple(a, b)) = Hash(MyTuple(b, a)) then there would be 2¹⁹ or 2²⁹ inputs and probability of collisions would be 2x from the hash without such requirement.

IOW: 2x increase of probability of collisions only happens when you either have bad hash function or too few possible inputs but this is property of the original task, not property of H(a) ^ H(b) scheme.

XOR is also not a problem and only a red herring: any combination of two hash values would lose information, doesn't matter if you use XOR or anything else. If hashes are cryptographic hashes then XOR is more than enough. What happens with weak hashes is interesting question, gut feeling is that calling hash three times would be better than two and XOR, but hard to say because it would depend on properties of hash function.

I simplified the problem a bit on purpose to demonstrate how using two (or in this case, three) Hashers increases probability. So I was really analyzing the ordered pairs version.

But the same analysis applies to the version that matches the original problem with unordered pairs:

hash(a, b) = H(min(H(a), H(b)) | max(H(a), H(b)))

gives double the probability of collision than the one with just one hasher and Ord:

hash(a, b) = H(min(a, b) | max(a, b))

For instance, if you have a fully random, or some crypto hash with 64 bit hashes, the former gives 2^-63 probability bound, the latter gives 2^-64.

The one with xor:
hash(a, b) = H(H(a) xor H(b))
only gives 2^-0 probability bound (example: (1,1), (2,2)).

This makes no sense. Collision probability of a fully random hash doesn't depend on the size of the input space. It depends on the size of the output space.

You're taking two different elements of the input space, and asking what is the probability that the random hashes are same. The answer is: 1 divided by the size of the output space. So if the random hash is a 64-bit hash, the collision probability is 2^-64, regardless of how many bits the inputs have.

If you apply the random hash twice in a row, H(H(x)), the answer is: 2 divided by the size of the output space (approximately).

1 Like