Hash Prefix collisions

BratSinot · February 15, 2022, 9:28am

Greetings!

I can't get one thing. I saw this in standard library:

    impl Hash for str {
        #[inline]
        fn hash<H: Hasher>(&self, state: &mut H) {
            state.write(self.as_bytes());
            state.write_u8(0xff) // here
        }
    }

From the official doc:

For example, the standard implementation of Hash for &str passes an extra 0xFF byte to the Hasher so that the values ("ab", "c") and ("a", "bc") hash differently.

Okay, this maybe have sense, but I can't do such code:

fn calc_hash1(builder: &DefaultHashBuilder, data: &(&str, &str)) -> u64 {
    let mut s = builder.build_hasher();
    s.write(data.0.as_bytes());
    s.write(data.1.as_bytes());
    s.finish()
}

let data0 = ("ab", "c");
let data1 = ("a", "bc");

println!("{}", calc_hash1(&hash_builder, &data0));
println!("{}", calc_hash1(&hash_builder, &data1));

and got different hashes.

How will look like code with such collision???

tczajka · February 15, 2022, 9:48am

Whether you get a collision in this case or not depends on the implementation of Hasher::write.

I get a collision with DefaultHasher:

The prefix requirement is a bit ambiguous. E.g. are two calls to Hasher::write_u32(1) the same as one call to Hasher::write_u64(0x100000001)? It's not well specified.

H2CO3 · February 15, 2022, 9:52am

It looks like you aren't using std types, are you? There is no DefaultHashBuilder in std. The closest I could find is hashbrown::DefaultHashBuilder, which is just an alias for RandomState. This means that if you build two hashers, they will (with overwhelming probability) hash the very same value into different hashes. You are likely running into this.

Cerber-Ursi · February 15, 2022, 10:25am

But why do you expect them to be different? The implementation you're linking to is impl Hash for str, but you don't use this implementation, you hash str.as_bytes() directly, so this 0xff "padding byte" is unused.

H2CO3 · February 15, 2022, 10:30am

OP is trying to replicate a case where hashing ("ab", "c") results in a collision with ("a", "bc"), and for that purpose, s/he is intentionally hashing the bytes only, to have exact control over the sequence of bytes being written, and expecting the same hash, but got two different ones.

Cerber-Ursi · February 15, 2022, 11:32am

Ah. Sorry, misread as "can't do ... and get different".

SkiFire13 · February 15, 2022, 12:56pm

No, a particular instance RandomState will create the same instances of Hasher, but the hashers created by two different RandomState instances are unlikely to produce the same result for the same values (quoted word by word from RandomState's documentation). Looks like OP is passing a reference to the same hash_builder though, so it should get a collision. It would be helpful if OP shared a playground link that actually shows no collision taking place, otherwise we're just talking about speculations.

H2CO3 · February 15, 2022, 1:06pm

You are right. The problem lies elsewhere, namely in <AHasher as Hasher>::write():

fn write(&mut self, input: &[u8]) {
    let mut data = input;
    let length = data.len();
    self.add_in_length(length as u64);
    // etc.
}

Meaning that the write() method factors in the length of the byte slice, resulting in different hashes for ([u8; 2], [u8; 1]) and ([u8; 1], [u8; 2]).

Cerber-Ursi · February 15, 2022, 1:18pm

This is, in fact, currently redundant, since impl Hash for [T] (in current Rust) hashes the length too:

impl<T: Hash> Hash for [T] {
    #[inline]
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.len().hash(state);
        Hash::hash_slice(self, state)
    }
}

scottmcm · February 15, 2022, 7:34pm

H2CO3:

You are right. The problem lies elsewhere, namely in <AHasher as Hasher>::write() :
fn write(&mut self, input: &[u8]) {
    let mut data = input;
    let length = data.len();
    self.add_in_length(length as u64);
    // etc.
}

Hmm, it's not clear to me that this isn't a bug.

It feels to me like the wrong place for this -- it's not Hasher::write that should be doing this, but rather [T]::hash. I wouldn't expect that I'd need to give exactly the same sequence of write calls to the Hasher, so long as I give the same bytes.

And if nothing else, it's doing more work that it needs to. Hash is already adding the length for slices, so doing it again here is a performance loss for no good reason: https://doc.rust-lang.org/1.58.0/src/core/hash/mod.rs.html#754-760

I'll go open a bug to clarify expectations. https://doc.rust-lang.org/std/hash/trait.Hash.html#method.hash_slice does a good job of talking about the "whole unit" behaviour, but https://doc.rust-lang.org/std/hash/trait.Hasher.html#tymethod.write is silent on the question.

EDIT: https://github.com/rust-lang/rust/issues/94026

tczajka · February 16, 2022, 12:22am

I don't think it does. It just says its behavior is "left unspecified" but it would be good to specify the requirements. At the very least you got to have the property that calling hash_slice on equal slices results in the same calls to Hasher, currently it doesn't even say that.

Even for Hash::hash it currently doesn't really say that. Instead it says that the final hash must be same for equal keys, but calculating a hash is a Hasher's job. Hash documentation should just talk about what Hash implementations must do, not what a Hasher must do.

I have added a comment to the ticket proposing the requirements for Hash::hash.

For hash_slice I think it would be good to have a decision one way or the other. Either:

We treat hash_slice(&[1]); hash_slice(&[2]); as equivalent to hash_slice(&[1, 2]);, in which case it should be stated as a requirement on Hash implementations. Or:
We treat them as different things, in which case the default implementation needs to change.

BratSinot · February 18, 2022, 12:56pm

Hm, ok, it based on fact that 0xFF non UTF-8 symbol and can't be in String without unsafe and unsafe is unsafe?

tczajka · February 18, 2022, 1:24pm

Yes that's the idea behind the str Hash implementation. This makes it so that the bytes written to the Hasher are a prefix-free encoding of strings, and using the 0xff byte as a delimiter achieves that since it never appears in UTF-8. The prefix-free encoding property of Hash is necessary for good, randomized hashers (such as the default SipHasher) that seek to provide low probability of collision for all keys.

However, this assumes that:

hasher.write(&[1, 2, 3]);
hasher.write_u8(0xff);

is equivalent to:

hasher.write(&[1, 2, 3, 0xff]);

which is not specified in the docs.

The ticket referenced above by @scottmcm seeks to clarify this.

system · May 19, 2022, 1:24pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Why does `str.hash(...)` pass an extra byte to the Hasher? help	15	1105	December 30, 2021
How to implement its own hash function? help	18	5135	January 12, 2023
Create two different Hashers?	5	510	July 15, 2022
Advice on Hashing Confusion help	4	749	May 21, 2022
Using custom hash function help	11	2389	March 19, 2021

Hash Prefix collisions

Related topics