In implementing a custom instance of Hasher, I noticed that the results of my Rust port didn't match up with the original C. Upon inspecting the actual bytes passed to Hasher::write by Hash::hash, I noticed an extra naughty byte not present in the original input (itself a str). Sure enough, upon checking the source for impl Hash for str, we see:
I think this is to ensure that something like ("foo", "bar") hashes differently than ("foobar", ""). Some hashers already include the length in write(&[u8]), but they don't have to.
Formally speaking what this does is to make what is passed to the hasher a prefix-free encoding of the input. Without the 0xff it would not be prefix-free. This is a neccessary and sufficient property for making sequence encodings unique.
I think this should be documented as a requirement of Hash.
Indeed, the sequence property is quite important, since otherwise it would be the responsibility of every combinator to encode a "field" separator within the hash.
An example of this issue can be noticed in BTreeMap, which forgets to encode its end in some form (e.g., by hashing its length beforehand or afterwards). This leads to:
Indeed! But, fwiw, this only works because str needs to be valid UTF-8, and thus, cannot feature the \xff sequence anywhere inside it. Just wanted to point that out in case somewhere were to be tempted to apply this "append 0xff" trick to, say, something like [u8]
To me it feels that while it's OK for an implementation of Hasher to generate any collisions it wants on the data it's fed (you can have better or worse hashers, it's a quality issue), it should not be OK for unequal data to feed the same info to the Hasher, thus giving it no chance to succeed.
I.e. to me seems that collisions should only happen in Hasher, not in Hash, since the latter is avoidable at least for 99% of types, and those types where it's not possible should probably not implement Hash.
If you make Hash collision-free (best achieved by the prefix-free requirement), it's possible to implement a Hasher that mathematically guarantees a certain collision probability bound (using universal hashing).
If you don't have that guarantee on Hash, it becomes impossible to implement such a universal Hasher.
Hashing a slice hashes the length there - basically for the same reason, just being a terminator. In that light, just hashing one byte on a str as a terminator instead of the length, is a small feature that saves a "lot" of work (in some circumstances).