I can't get one thing. I saw this in standard library:
impl Hash for str {
#[inline]
fn hash<H: Hasher>(&self, state: &mut H) {
state.write(self.as_bytes());
state.write_u8(0xff) // here
}
}
From the official doc:
For example, the standard implementation of Hash for &str passes an extra 0xFF byte to the Hasher so that the values ("ab", "c") and ("a", "bc") hash differently.
Okay, this maybe have sense, but I can't do such code:
fn calc_hash1(builder: &DefaultHashBuilder, data: &(&str, &str)) -> u64 {
let mut s = builder.build_hasher();
s.write(data.0.as_bytes());
s.write(data.1.as_bytes());
s.finish()
}
let data0 = ("ab", "c");
let data1 = ("a", "bc");
println!("{}", calc_hash1(&hash_builder, &data0));
println!("{}", calc_hash1(&hash_builder, &data1));
Whether you get a collision in this case or not depends on the implementation of Hasher::write.
I get a collision with DefaultHasher:
The prefix requirement is a bit ambiguous. E.g. are two calls to Hasher::write_u32(1) the same as one call to Hasher::write_u64(0x100000001)? It's not well specified.
It looks like you aren't using std types, are you? There is no DefaultHashBuilder in std. The closest I could find is hashbrown::DefaultHashBuilder, which is just an alias for RandomState. This means that if you build two hashers, they will (with overwhelming probability) hash the very same value into different hashes. You are likely running into this.
But why do you expect them to be different? The implementation you're linking to is impl Hash for str, but you don't use this implementation, you hash str.as_bytes() directly, so this 0xff "padding byte" is unused.
OP is trying to replicate a case where hashing ("ab", "c") results in a collision with ("a", "bc"), and for that purpose, s/he is intentionally hashing the bytes only, to have exact control over the sequence of bytes being written, and expecting the same hash, but got two different ones.
No, a particular instance RandomState will create the same instances of Hasher, but the hashers created by two different RandomState instances are unlikely to produce the same result for the same values (quoted word by word from RandomState's documentation). Looks like OP is passing a reference to the same hash_builder though, so it should get a collision. It would be helpful if OP shared a playground link that actually shows no collision taking place, otherwise we're just talking about speculations.
It feels to me like the wrong place for this -- it's not Hasher::write that should be doing this, but rather [T]::hash. I wouldn't expect that I'd need to give exactly the same sequence of write calls to the Hasher, so long as I give the same bytes.
I don't think it does. It just says its behavior is "left unspecified" but it would be good to specify the requirements. At the very least you got to have the property that calling hash_slice on equal slices results in the same calls to Hasher, currently it doesn't even say that.
Even for Hash::hash it currently doesn't really say that. Instead it says that the final hash must be same for equal keys, but calculating a hash is a Hasher's job. Hash documentation should just talk about what Hash implementations must do, not what a Hasher must do.
I have added a comment to the ticket proposing the requirements for Hash::hash.
For hash_slice I think it would be good to have a decision one way or the other. Either:
We treat hash_slice(&[1]); hash_slice(&[2]); as equivalent to hash_slice(&[1, 2]);, in which case it should be stated as a requirement on Hash implementations. Or:
We treat them as different things, in which case the default implementation needs to change.
Yes that's the idea behind the strHash implementation. This makes it so that the bytes written to the Hasher are a prefix-free encoding of strings, and using the 0xff byte as a delimiter achieves that since it never appears in UTF-8. The prefix-free encoding property of Hash is necessary for good, randomized hashers (such as the default SipHasher) that seek to provide low probability of collision for all keys.
However, this assumes that:
hasher.write(&[1, 2, 3]);
hasher.write_u8(0xff);
is equivalent to:
hasher.write(&[1, 2, 3, 0xff]);
which is not specified in the docs.
The ticket referenced above by @scottmcm seeks to clarify this.