Is this custom Eq/Hash impl "correct"?

#1
  1. This is a followup to Hashing a struct that is all ref

  2. I have a number of structs: Cat, Dog, Bird, that do NOT have Eq/Hash implemented on them.

  3. I want to be able to do Eq/Hash on an “Animal” struct, by comparing the value of the Ref AS POINTERS (rather than by value).

  4. Does the following impl satisfy the above? Are there any gotchas I need to watch out for?

use std::hash::Hash;
use std::hash::Hasher;


// note: none of these have Eq/Hash defined
pub struct Cat{}
pub struct Dog{}
pub struct Bird{}


// for this class, I want to define Eq/Hash based
// on the "pointer value" of the ref
pub enum Animal<'a> {
    Cat(&'a Cat),
    Dog(&'a Dog),
    Bird(&'a Bird),
}


impl<'a> PartialEq for Animal<'a> {
    fn eq(&self, other: &Animal<'a>) -> bool {
        match (self, other) {
            (Animal::Cat(lhs), Animal::Cat(rhs)) => (*lhs as *const Cat) == (*rhs as *const Cat) ,
            (Animal::Dog(lhs), Animal::Dog(rhs)) => (*lhs as *const Dog) == (*rhs as *const Dog) ,
            (Animal::Bird(lhs), Animal::Bird(rhs)) => (*lhs as *const Bird) == (*rhs as *const Bird) ,
            _ => false
        }
    }
}

impl<'a> Hash for Animal<'a> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        match self {
            Animal::Cat(v) => {
                state.write_i32(1);
                state.write_usize(v as *const Cat as usize);
                state.finish(); },
            Animal::Dog(v) => {
                state.write_i32(2);
                state.write_usize(v as *const Dog as usize);
                state.finish(); },
            Animal::Bird(v) => {
                state.write_i32(3);
                state.write_usize(v as *const Dog as usize);
                state.finish(); },
        }
    }
}
1 Like
#2

Looking at the docs, it’s seen that you usually don’t call .finish() in the method itself… And why add the extra .write_i32()? Let the pointer speak for itself and you don’t have to worry about the specific enum type it is in the hash.
Otherwise number 3 is fine…

1 Like
#3

@OptimisticPeach : Good points. It is not possible (under normal circumstances) for a single pointer address to be referred to as both a Cat AND a DOG.

Is this what you have in mind?


impl<'a> Hash for Animal<'a> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        match self {
            Animal::Cat(v) => { state.write_usize(v as *const Cat as usize); },
            Animal::Dog(v) => { state.write_usize(v as *const Dog as usize); },
            Animal::Bird(v) => { state.write_usize(v as *const Dog as usize); },
        }
    }
}
1 Like
#4

Yes that’s exactly what I was referring to
Edit: well perhaps not with the typo lol

1 Like
#5

I would not try to be clever by assuming a &Cat and &Dog cannot refer to the same address. This is not necessarily true when Cat or Dog is a zero-sized type, as in the example. In your real code I imagine none of these are zero-sized, but I would still stick with the extra write_i32 in your original draft over the possibility of a future addition making your code subtly wrong.

2 Likes
#6

@trentj: LOL, I think you’re right. In theory, it’s possible to have zero-sized classes, and thus we should add an i32 “tag”

#7

Given that you are casting Bird as Dog and it’s even compiling at all should raise a red flag.

I’m pretty sure this is unsound strictly because unless the reference has a static lifetime, you have no guarantee that a memory address is never reused for a different type.

1 Like
#8
  1. You’re right, the Bird line should be Bird, not Dog

  2. I’m not convinced there is a lifetime issue here at all. If we are using this in a HashSet / HashMap, the borrow checker is going to ensure that the lifetime of the inserted keys is atleast as long as the lifetime of the HashSet/HashMap.

I don’t see any possibility of a “dangling pointer” to an object that has been freed - though if you could construct a counter example, I would greatly appreciate it.

#9

Well, it doesn’t seem to compile after all.

error[E0606]: casting `&&Cat` as `*const Cat` is invalid
  --> src/main.rs:26:51
   |
26 |             Animal::Cat(v) => { state.write_usize(v as *const Cat as usize); },
   |                                                   ^^^^^^^^^^^^^^^

But with a few changes: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=030d36a83eadfaf64502552ecef7ceef

Cat Hash is f491762009bba41f!
Dog Hash is f491762009bba41f!

Looks like it fails the most basic test. :neutral_face:

1 Like
#10

Let me be clear, the example “fails” because both references use the same address after Cat is no longer needed. The pointer is valid over the lifetime of Cat, and therefore the hash for it is also valid over the same lifetime. But it is unsound to reuse the hash outside of the lifetime, which is trivial (and perfectly safe) as you can see from the example.

#11

This is only because they are zero sized types, if you add non-zero sized fields to them, then this problem disappears. This has already been noted here

It is concerning, and I think this is one reason why this isn’t the default implementation.

1 Like
#12
  1. As stated in

I agree that write_i32 should be added.

  1. @parasyte The “counter example” can be fixed by adding a write_i32, as noted in https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=030d36a83eadfaf64502552ecef7ceef … and as stated by @KrishnaSannasi and @trentj above.

  2. This “counter example” is caused by 0-sized structs, not by “dangling pointers.”

#13

The previous specific example uses zero-sized structs, yes. But other counter examples can be produced: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=c70e14158c007e3f4cef40dc697550f0 (Note: I never claimed the issue was related to dangling pointers. The issue is more related to using an index in place of a pointer. Which is well-documented to workaround lifetime issues at the expense of running afoul with the ABA problem.)

Adding extra entropy to the hash helps make it collision resistant, just not collision-proof.

2 Likes
#14
  1. This is a very neat/impressive example where by setting up alpha/beta to have the same stack representation, you can assure that Cat & Dog are allocated at the same address (across different calls).

  2. For my use case (which I admit was never stated in the problem statement), I’m stuffing these objects into a HashSet / HashMap. I don’t think this attack works on that.

Because although you create two objects with the same hash, these objects do not co-exist in time. One is dropped before the other is created, so if they can’t be inserted into the same HashSet.

1 Like
#15

Yep, this is perfectly reasonable and guaranteed safe by the borrow rules.

For this specific use case, adding the extra entropy probably hurts more than it helps. You’re already statically guaranteeing that each item has a unique address in memory (and guaranteeing that they each outlive the HashSet). The extra entropy could lead to a birthday problem where different addresses (for different data types) hash to the same value. In addition to the extra CPU used to add entropy needlessly.

1 Like
#16

PS: I found this discussion really enlightening. The back & forth examples really helped me understand precisely what conditions we need for this to work / not work.

2 Likes
#17

This is possible, but not relevant. It’s just as possible that two different addresses hash to the same value alone, but adding a flag to distinguish the types gives them different hashes. Since this scenario is equally likely to yours, collisions are equally likely to happen whether or not you add the extra i32. (Unless the hash function is pathological, but again, the hash function can be pathological either way, so this isn’t an argument for or against adding it.)

The only concrete reason to avoid adding a discriminant is to avoid the cost of calculating it. But it seems unlikely that this would make a measurable difference. I’d rather have code that handles ZSTs correctly. YMMV.

By the way, speaking of discriminants reminds me that std::mem::discriminant exists, and Discriminant<T> implements Hash. So you don’t have to write 1, 2, 3 into your program by hand; you can just write this:

impl<'a> Hash for Animal<'a> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        std::mem::discriminant(self).hash(state);
        match *self {
            Animal::Cat(v) => { state.write_usize(v as *const _ as usize); },
            Animal::Dog(v) => { state.write_usize(v as *const _ as usize); },
            Animal::Bird(v) => { state.write_usize(v as *const _ as usize); },
        }
    }
}
1 Like
#18

Given the discussion, I’d almost consider the following as well:

#[derive(Hash, PartialEq)]
pub enum Animal<'a> {
    Cat(usize),
    Dog(usize),
    Bird(usize),
    #[doc(hidden)]
    _Phantom(&'a ()),
}

The usize payloads would be the ptr value, and then add accessors that coerce that value back to the appropriate reference.

:crazy_face:

1 Like
#19

There is more to say about hash collisions. An obvious place to start is the size of the hash compared to the size of the input. E.g. a 32-bit hash over 64-bit virtual addresses would make collisions very common. A 64-bit hash would mostly eliminate collisions, but then you would want to just compare the addresses directly instead of hashing.

#20

Something that nobody has mentioned is that it’s perfectly valid to write a noop Hash implementation that just returns the same thing every time. Users of the hash are not allowed to trust the hash and are supposed to check Eq if equality matters.

Similarly, you shouldn’t have to worry about reusing locations, as it’s impossible by definition to have two alive in the same position that shouldn’t be Eq (unless they’re ZST).

Eq should probably use ptr::eq rather than _ as *const _ as usize == _ as *const _ as usize. (And in the future, ptr::hash.)

If you care about identity for Eq though, you need to make sure all of the things you point to aren’t ZST, otherwise Eq will be true incorrectly.

(I wonder if it’s possible to write a eq that optimizes to just compare the ref and not look at the discriminant at all, since if the types aren’t ZSTs, they would have to be at different addresses if you have a ref to both. The current probably optimizes to comparing the discriminant and the ref as a 2xusize, which, because SIMD, might not even be slower :sweat_smile:)

1 Like