I'm trying to implement a tuple struct with two elements so that `MyTuple(a, b)`

and `MyTuple(b, a)`

are considered to be the same when used in a `HashSet`

.

I'm wondering how I can implement the `Hash`

trait to return the same value whatever the order of the elements.

Are they orderable? You could hash the lesser value first. Make sure `MyTuple(a, b)`

and `MyTuple(b, a)`

compare equal under `==`

, too.

Alternatively you could use privacy and always store them ordered as a logical invariant, and then just derive these traits.

This might be hard (or impossible) to do. Is your tuple struct generic or is this about concrete types? In the latter case, you could sort the two values first, then hash in sorted order.

This fact that it's hard to make hashing ignore order is related to why it's hard to implement `Hash`

for a `HashMap`

itself. (The standard library just doesn't do it at all.) Note that in order to use one `Map`

as a key for another `Map`

, you can use `BTreeMap`

s instead, since they rely on `Ord`

and implement `Ord`

themself, too.

If your struct is generic, implementing `Ord`

for a `MyTuple`

struct in a way that order doesn't matter is probably way easier than implementing `Hash`

; you very just can use the "sort the two fields first" approach since you *do* have an `Ord`

bound already (and then use lexicographic order on the sorted tuple). So if `Hash`

seems like too much of a hassle / you (and no-one in this thread) has a good idea how to do it, consider implementing `Ord`

like this and using a `BTreeMap`

I think it's correct to hash the two fields independently, and combine their resulting hash with an xor.

```
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
struct MyTuple(i32, i32);
fn calculate_hash<T: Hash>(t: &T) -> u64 {
let mut s = DefaultHasher::new();
t.hash(&mut s);
s.finish()
}
impl Hash for MyTuple {
fn hash<H: Hasher>(&self, state: &mut H) {
let h = calculate_hash(&self.0) ^ calculate_hash(&self.1);
state.write_u64(h);
}
}
fn main() {
dbg!(calculate_hash(&MyTuple(12, 99)));
dbg!(calculate_hash(&MyTuple(99, 12)));
}
```

This violates the principle that `Hash`

should write different things to `Hasher`

when values are not equal. For instance, `MyTuple(3, 3)`

would write the same thing as `MyTuple(5, 5)`

.

This also cheats the user of `Hash`

into believing *they* choose the `Hasher`

when in reality you chose `hash_map::DefaultHasher`

for them.

If the two elements implement **both** `Ord`

and `Hash`

, then it's easy: you first call `hash`

on the smaller element, then on the larger.

Edit: @quinedot already said essentially that.

For anyone interested, look e.g. at the discussion in this past (closed) PR that tried adding such an implementation

Without `Ord`

, I can do something that **almost** works correctly:

- Feed a and b to a strong, cryptographic Hasher.
- Sort the crypto hashes.
- Feed the sorted hashes to the given Hasher.

I say it "almost" works, because it relies on the crypto Hasher never having collisions, which is only an approximation, of course. For instance, imagine the user-provided Hasher is an even better crypto Hasher of their choice -- again they would be cheated about the guarantees.

Coincidentally, I made an example of exactly that in https://doc.rust-lang.org/std/hash/trait.BuildHasher.html#method.hash_one.

Where this principle is stated? You have to write the same things to `Hasher`

if values are equal, but I don't see why it would be a requirement to write different things when they are different.

You can ask for `Hasher`

+ `Clone`

instead. DefaultHashes is `Clone`

and I suspect most others should be `Clone`

, too.

How does it rely on lack of collisions? Collisions are perfectly fine in such a scheme, you may only have trouble when Hasher produces different results for the same input, but then it's not hasher at all.

Why can't you use user-provided Hasher?

AFAICT, it states an even stronger principle here:

## Prefix collisions

Implementations of

`hash`

should ensure that the data they pass to the`Hasher`

are prefix-free. That is, unequal values should cause two different sequences of values to be written, and neither of the two sequences should be a prefix of the other.For example, the standard implementation of

`Hash`

for`&str`

passes an extra`0xFF`

byte to the`Hasher`

so that the values`("ab", "c")`

and`("a", "bc")`

hash differently.

steffahn:This also cheats the user of

`Hash`

into believingtheychoose the`Hasher`

when in reality you chose`hash_map::DefaultHasher`

for them.You can ask for

`Hasher`

+`Clone`

instead.

You can't. If you implement `Hash::hash`

, you must introduce the same kind of generic `H: Hasher`

parameter as in the trait declaration, no way to specify additional bounds.

tczajka:I say it "almost" works, because it relies on the crypto Hasher never having collisions, which is only an approximation, of course.

How does it rely on lack of collisions? Collisions are perfectly fine in such a scheme, you may only have trouble when Hasher produces different results for the same input, but then it's not hasher at all.

The reason is the same guarantee I cited above: You should not pass the same data to the `Hasher`

for values that aren't equal.

tczajka:For instance, imagine the user-provided Hasher is an even better crypto Hasher of their choice -- again they would be cheated about the guarantees.

Why can't you use user-provided Hasher?

A assume this question assumes an approach using something like `Hasher + Clone`

again (which doesn't work, as explained above)? If not, you'll probably need to specify more clearly how else you imagine a user-provided hasher could be used.

That prefix-free "should" is still just a suggestion though, not a requirement. `Hash`

still works if you do a poor job of avoiding collisions, as long as it's consistent with `Eq`

.

Well if we ignore the collision avoidance property / suggestion, there is an easy solution to the original problem that satisfies all the remaining requirements:

```
impl Hash for MyTuple {
fn hash<H: Hasher>(&self, state: &mut H) {
// done
}
}
```

This seems tongue in cheek, but I don't think it's reasonable to try to avoid some collisions and ignore others, **unless** you are targeting a specific input data distribution (and then you have to know what that distribution is).

For instance, specifically, you might do @olivren's xor thing that I was commenting on, that technically satisfies all the hard requirements, but if your data often has a == b, then that's pretty much as good as the empty solution.

I.e. no matter how sophisticated the `HashMap`

implementation is and how good a `Hasher`

you supply to it, you'll get a lot of collisions and O(n) rather than O(1) `HashMap`

behavior.

VorfeedCanal: steffahn:This also cheats the user of

`Hash`

into believingtheychoose the`Hasher`

when in reality you chose`hash_map::DefaultHasher`

for them.You can ask for

`Hasher`

+`Clone`

instead.You can't. If you implement

`Hash::hash`

, you must introduce the same kind of generic`H: Hasher`

parameter as in the trait declaration, no way to specify additional bounds.

Right.

But even if it was possible somehow, there is another issue: by using a second `Hasher`

inside `Hash`

, even if it's the same type of `Hasher`

, you've just doubled the collision probability that the `Hasher`

provides. You now get a collision if either copy generates a collision.

This might be fine for some use cases, but not for others, so probably not a good idea for a general library.

You now get a collision if either copy generates a collision.

Nope. Probability haven't changed because one collision is not enough. You either need two or you need collision after mixing them up.

Of course you can imagine hashes where such combination is not desirable, but if hash is good probability shouldn't change.

The fact that we are dealing with Rust and not C++ is a bit of bummer, though, simple scheme with `Hash + Clone`

indeed is not possible in Rust.

Nope. Probability haven't changed because one collision is not enough. You either need two or you need collision after mixing them up.

Imagine you're computing hash(a, b) = H(H(a) | H(b)) (where | is the concatenation operator).

Let's say your two values are (1, 2) and (1, 3).

hash(1, 2) = hash(1,3) if either one of the following is true:

a) H(2) = H(3), or

b) H(c) = H(d), where c ≠ d, c = H(1) | H(2), d = H(1) | H(3),

So, double the probability.

If you think of H as a random function, and you're hoping for a collision, you have two attempts: first you check if H(2) = H(3), and if that fails, you get another chance at H(c) = H(d).

Using xor instead of concatenation only makes it worse. Better to pass both hashes (sorted) than to xor them, because xoring loses information, e.g. it makes hash(1,1) = hash(2,2) collide with 100% probability regardless of H.

Right. But you are comparing it to probability of collision for just a “normal”, “regular” hash which **doesn't** produce the same values for MyTuple(a, b) and MyTuple(b, a)!

**Any** hash with this property would have “increased probability” of collisions! Consider “perfect random hash” (essentially assign random value for any bit sequence and remember it, it's the best you can ever do). If you only have, e.g., 20 or 30 bits of input then there would be 2²⁰ or 2³⁰ outputs, but if you, now, add a requirement that Hash(MyTuple(a, b)) = Hash(MyTuple(b, a)) then there would be 2¹⁹ or 2²⁹ inputs and probability of collisions would be 2x from the hash without such requirement.

IOW: 2x increase of probability of collisions only happens when you either have bad hash function or too few possible inputs but this is property of the **original task**, not property of H(a) ^ H(b) scheme.

XOR is also not a problem and only a red herring: any combination of two hash values would lose information, doesn't matter if you use XOR or anything else. If hashes are cryptographic hashes then XOR is more than enough. What happens with weak hashes is interesting question, gut feeling is that calling hash three times would be better than two and XOR, but hard to say because it would depend on properties of hash function.

But you are comparing it to probability of collision for just a “normal”, “regular” hash which

doesn'tproduce the same values for MyTuple(a, b) and MyTuple(b, a)!

I simplified the problem a bit on purpose to demonstrate how using two (or in this case, three) Hashers increases probability. So I was really analyzing the ordered pairs version.

But the same analysis applies to the version that matches the original problem with unordered pairs:

hash(a, b) = H(min(H(a), H(b)) | max(H(a), H(b)))

gives double the probability of collision than the one with just one hasher and Ord:

hash(a, b) = H(min(a, b) | max(a, b))

For instance, if you have a fully random, or some crypto hash with 64 bit hashes, the former gives 2^-63 probability bound, the latter gives 2^-64.

The one with xor:

hash(a, b) = H(H(a) xor H(b))

only gives 2^-0 probability bound (example: (1,1), (2,2)).

Consider “perfect random hash” (essentially assign random value for any bit sequence and remember it, it's the best you can ever do). If you only have, e.g., 20 or 30 bits of input then there would be 2²⁰ or 2³⁰ outputs, but if you, now, add a requirement that Hash(MyTuple(a, b)) = Hash(MyTuple(b, a)) then there would be 2¹⁹ or 2²⁹ inputs and probability of collisions would be 2x from the hash without such requirement.

This makes no sense. Collision probability of a fully random hash doesn't depend on the size of the input space. It depends on the size of the output space.

You're taking **two** different elements of the input space, and asking what is the probability that the random hashes are same. The answer is: 1 divided by the size of the output space. So if the random hash is a 64-bit hash, the collision probability is 2^-64, regardless of how many bits the inputs have.

If you apply the random hash twice in a row, H(H(x)), the answer is: 2 divided by the size of the output space (approximately).