Why does `str.hash(...)` pass an extra byte to the Hasher?

In implementing a custom instance of Hasher, I noticed that the results of my Rust port didn't match up with the original C. Upon inspecting the actual bytes passed to Hasher::write by Hash::hash, I noticed an extra naughty byte not present in the original input (itself a str). Sure enough, upon checking the source for impl Hash for str, we see:

    #[stable(feature = "rust1", since = "1.0.0")]
    impl Hash for str {
        #[inline]
        fn hash<H: Hasher>(&self, state: &mut H) {
            state.write(self.as_bytes());
            state.write_u8(0xff)
        }
    }

Does anyone know what that state.write_u8(0xff) is doing there?

2 Likes

I can't be sure, but I think it might be to ensure a predictable, reproducible, non-zero hash for empty strings.

1 Like

I think this is to ensure that something like ("foo", "bar") hashes differently than ("foobar", ""). Some hashers already include the length in write(&[u8]), but they don't have to.

12 Likes

Wouldn't that be up to the individual Hasher, though? There's often a non-zero seed that gets built upon; it would never be zero anyway.

My understanding is that Hash impls decide what to hash, and the Hasher decides how to hash it i.e. what hashing algorithm to use.

In that sense the code suggests an empty string would not hash anything (or more accurately, the same as an empty byte slice) on the first write call.

7 Likes

Ah, right, they're overlapping in their "emptiness", and so to disambiguate an extra byte in thrown in there in impl Hash for str.

For reference, a slice does this:

    impl<T: Hash> Hash for [T] {
        #[inline]
        fn hash<H: Hasher>(&self, state: &mut H) {
            self.len().hash(state);
            Hash::hash_slice(self, state)
        }
    }

    fn hash_slice<H: Hasher>(data: &[Self], state: &mut H)
    where
        Self: Sized,
    {
        for piece in data {
            piece.hash(state);
        }
    }

I dug up the commit that introduced the 0xFF byte. It is precisely to make ("a", "b") hash differently from ("ab", "").

26 Likes

Thanks for looking that up.

Formally speaking what this does is to make what is passed to the hasher a prefix-free encoding of the input. Without the 0xff it would not be prefix-free. This is a neccessary and sufficient property for making sequence encodings unique.

I think this should be documented as a requirement of Hash.

12 Likes

Indeed, the sequence property is quite important, since otherwise it would be the responsibility of every combinator to encode a "field" separator within the hash.

An example of this issue can be noticed in BTreeMap, which forgets to encode its end in some form (e.g., by hashing its length beforehand or afterwards). This leads to:

assert_eq!(
    deterministic_hash(
        (
            map! {
                "a" => 42,
            },
            map! {
                "b" => 27,
            },
        )
    ),
    deterministic_hash(
        (
            map! {
                "a" => 42,
                "b" => 27,
            },
            map!(),
        )
    ),
);

Indeed! But, fwiw, this only works because str needs to be valid UTF-8, and thus, cannot feature the \xff sequence anywhere inside it. Just wanted to point that out in case somewhere were to be tempted to apply this "append 0xff" trick to, say, something like [u8]

5 Likes

Filed:

It's not necessarily wrong if such cases hash to the same thing -- it's just an avoidable collision.

5 Likes

To me it feels that while it's OK for an implementation of Hasher to generate any collisions it wants on the data it's fed (you can have better or worse hashers, it's a quality issue), it should not be OK for unequal data to feed the same info to the Hasher, thus giving it no chance to succeed.

I.e. to me seems that collisions should only happen in Hasher, not in Hash, since the latter is avoidable at least for 99% of types, and those types where it's not possible should probably not implement Hash.

If you make Hash collision-free (best achieved by the prefix-free requirement), it's possible to implement a Hasher that mathematically guarantees a certain collision probability bound (using universal hashing).

If you don't have that guarantee on Hash, it becomes impossible to implement such a universal Hasher.

6 Likes

Hmm, okay, that's fair. (And let's switch to the issue if there's any more to say.)

Hashing a slice hashes the length there - basically for the same reason, just being a terminator. In that light, just hashing one byte on a str as a terminator instead of the length, is a small feature that saves a "lot" of work (in some circumstances).

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.