Hash Prefix collisions

Greetings!

I can't get one thing. I saw this in standard library:

    impl Hash for str {
        #[inline]
        fn hash<H: Hasher>(&self, state: &mut H) {
            state.write(self.as_bytes());
            state.write_u8(0xff) // here
        }
    }

From the official doc:

For example, the standard implementation of Hash for &str passes an extra 0xFF byte to the Hasher so that the values ("ab", "c") and ("a", "bc") hash differently.

Okay, this maybe have sense, but I can't do such code:

fn calc_hash1(builder: &DefaultHashBuilder, data: &(&str, &str)) -> u64 {
    let mut s = builder.build_hasher();
    s.write(data.0.as_bytes());
    s.write(data.1.as_bytes());
    s.finish()
}

let data0 = ("ab", "c");
let data1 = ("a", "bc");

println!("{}", calc_hash1(&hash_builder, &data0));
println!("{}", calc_hash1(&hash_builder, &data1));

and got different hashes.

How will look like code with such collision???

Whether you get a collision in this case or not depends on the implementation of Hasher::write.

I get a collision with DefaultHasher:

The prefix requirement is a bit ambiguous. E.g. are two calls to Hasher::write_u32(1) the same as one call to Hasher::write_u64(0x100000001)? It's not well specified.

1 Like

It looks like you aren't using std types, are you? There is no DefaultHashBuilder in std. The closest I could find is hashbrown::DefaultHashBuilder, which is just an alias for RandomState. This means that if you build two hashers, they will (with overwhelming probability) hash the very same value into different hashes. You are likely running into this.

1 Like

But why do you expect them to be different? The implementation you're linking to is impl Hash for str, but you don't use this implementation, you hash str.as_bytes() directly, so this 0xff "padding byte" is unused.

1 Like

OP is trying to replicate a case where hashing ("ab", "c") results in a collision with ("a", "bc"), and for that purpose, s/he is intentionally hashing the bytes only, to have exact control over the sequence of bytes being written, and expecting the same hash, but got two different ones.

Ah. Sorry, misread as "can't do ... and get different".

No, a particular instance RandomState will create the same instances of Hasher, but the hashers created by two different RandomState instances are unlikely to produce the same result for the same values (quoted word by word from RandomState's documentation). Looks like OP is passing a reference to the same hash_builder though, so it should get a collision. It would be helpful if OP shared a playground link that actually shows no collision taking place, otherwise we're just talking about speculations.

1 Like

You are right. The problem lies elsewhere, namely in <AHasher as Hasher>::write():

fn write(&mut self, input: &[u8]) {
    let mut data = input;
    let length = data.len();
    self.add_in_length(length as u64);
    // etc.
}

Meaning that the write() method factors in the length of the byte slice, resulting in different hashes for ([u8; 2], [u8; 1]) and ([u8; 1], [u8; 2]).

This is, in fact, currently redundant, since impl Hash for [T] (in current Rust) hashes the length too:

impl<T: Hash> Hash for [T] {
    #[inline]
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.len().hash(state);
        Hash::hash_slice(self, state)
    }
}

Hmm, it's not clear to me that this isn't a bug.

It feels to me like the wrong place for this -- it's not Hasher::write that should be doing this, but rather [T]::hash. I wouldn't expect that I'd need to give exactly the same sequence of write calls to the Hasher, so long as I give the same bytes.

And if nothing else, it's doing more work that it needs to. Hash is already adding the length for slices, so doing it again here is a performance loss for no good reason: https://doc.rust-lang.org/1.58.0/src/core/hash/mod.rs.html#754-760

I'll go open a bug to clarify expectations. https://doc.rust-lang.org/std/hash/trait.Hash.html#method.hash_slice does a good job of talking about the "whole unit" behaviour, but https://doc.rust-lang.org/std/hash/trait.Hasher.html#tymethod.write is silent on the question.

EDIT: https://github.com/rust-lang/rust/issues/94026

7 Likes

I don't think it does. It just says its behavior is "left unspecified" but it would be good to specify the requirements. At the very least you got to have the property that calling hash_slice on equal slices results in the same calls to Hasher, currently it doesn't even say that.

Even for Hash::hash it currently doesn't really say that. Instead it says that the final hash must be same for equal keys, but calculating a hash is a Hasher's job. Hash documentation should just talk about what Hash implementations must do, not what a Hasher must do.

I have added a comment to the ticket proposing the requirements for Hash::hash.

For hash_slice I think it would be good to have a decision one way or the other. Either:

  1. We treat hash_slice(&[1]); hash_slice(&[2]); as equivalent to hash_slice(&[1, 2]);, in which case it should be stated as a requirement on Hash implementations. Or:
  2. We treat them as different things, in which case the default implementation needs to change.

Hm, ok, it based on fact that 0xFF non UTF-8 symbol and can't be in String without unsafe and unsafe is unsafe?

Yes that's the idea behind the str Hash implementation. This makes it so that the bytes written to the Hasher are a prefix-free encoding of strings, and using the 0xff byte as a delimiter achieves that since it never appears in UTF-8. The prefix-free encoding property of Hash is necessary for good, randomized hashers (such as the default SipHasher) that seek to provide low probability of collision for all keys.

However, this assumes that:

hasher.write(&[1, 2, 3]);
hasher.write_u8(0xff);

is equivalent to:

hasher.write(&[1, 2, 3, 0xff]);

which is not specified in the docs.

The ticket referenced above by @scottmcm seeks to clarify this.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.