Why does `str.hash(...)` pass an extra byte to the Hasher?

fosskers · September 30, 2021, 10:19pm

In implementing a custom instance of Hasher, I noticed that the results of my Rust port didn't match up with the original C. Upon inspecting the actual bytes passed to Hasher::write by Hash::hash, I noticed an extra naughty byte not present in the original input (itself a str). Sure enough, upon checking the source for impl Hash for str, we see:

    #[stable(feature = "rust1", since = "1.0.0")]
    impl Hash for str {
        #[inline]
        fn hash<H: Hasher>(&self, state: &mut H) {
            state.write(self.as_bytes());
            state.write_u8(0xff)
        }
    }

Does anyone know what that state.write_u8(0xff) is doing there?

jjpe · September 30, 2021, 10:22pm

I can't be sure, but I think it might be to ensure a predictable, reproducible, non-zero hash for empty strings.

cuviper · September 30, 2021, 10:22pm

I think this is to ensure that something like ("foo", "bar") hashes differently than ("foobar", ""). Some hashers already include the length in write(&[u8]), but they don't have to.

fosskers · September 30, 2021, 10:23pm

Wouldn't that be up to the individual Hasher, though? There's often a non-zero seed that gets built upon; it would never be zero anyway.

jjpe · September 30, 2021, 10:26pm

My understanding is that Hash impls decide what to hash, and the Hasher decides how to hash it i.e. what hashing algorithm to use.

In that sense the code suggests an empty string would not hash anything (or more accurately, the same as an empty byte slice) on the first write call.

fosskers · September 30, 2021, 10:27pm

Ah, right, they're overlapping in their "emptiness", and so to disambiguate an extra byte in thrown in there in impl Hash for str.

For reference, a slice does this:

    impl<T: Hash> Hash for [T] {
        #[inline]
        fn hash<H: Hasher>(&self, state: &mut H) {
            self.len().hash(state);
            Hash::hash_slice(self, state)
        }
    }

    fn hash_slice<H: Hasher>(data: &[Self], state: &mut H)
    where
        Self: Sized,
    {
        for piece in data {
            piece.hash(state);
        }
    }

kpreid · September 30, 2021, 10:30pm

I dug up the commit that introduced the 0xFF byte. It is precisely to make ("a", "b") hash differently from ("ab", "").

https://github.com/rust-lang/rust/commit/6066118b2988fee1b758d542cee18ffe7a84acab

fosskers · September 30, 2021, 10:31pm

Thanks for looking that up.

tczajka · October 1, 2021, 6:17am

Formally speaking what this does is to make what is passed to the hasher a prefix-free encoding of the input. Without the 0xff it would not be prefix-free. This is a neccessary and sufficient property for making sequence encodings unique.

I think this should be documented as a requirement of Hash.

Yandros · October 1, 2021, 11:55am

Indeed, the sequence property is quite important, since otherwise it would be the responsibility of every combinator to encode a "field" separator within the hash.

An example of this issue can be noticed in BTreeMap, which forgets to encode its end in some form (e.g., by hashing its length beforehand or afterwards). This leads to:

assert_eq!(
    deterministic_hash(
        (
            map! {
                "a" => 42,
            },
            map! {
                "b" => 27,
            },
        )
    ),
    deterministic_hash(
        (
            map! {
                "a" => 42,
                "b" => 27,
            },
            map!(),
        )
    ),
);

Playground

Indeed! But, fwiw, this only works because str needs to be valid UTF-8, and thus, cannot feature the \xff sequence anywhere inside it. Just wanted to point that out in case somewhere were to be tempted to apply this "append 0xff" trick to, say, something like [u8]

kpreid · October 1, 2021, 3:01pm

Filed:

github.com/rust-lang/rust

`std::hash::Hash` documentation should suggest that the hash data should be prefix-free

opened 03:01PM - 01 Oct 21 UTC

closed 10:01PM - 10 Oct 21 UTC

kpreid

C-enhancement T-libs-api A-docs

As discussed at [Why does `str.hash(…)` pass an extra byte to the Hasher?](https…://users.rust-lang.org/t/why-does-str-hash-pass-an-extra-byte-to-the-hasher/65404): the code of `impl Hash for &str` specifically passes an extra 0xFF byte to the Hasher, so that values like `("ab", "c")` and `("a", "bc")` hash differently (added in https://github.com/rust-lang/rust/commit/6066118b2988fee1b758d542cee18ffe7a84acab). This is a subtle property of hashing and should probably be mentioned in the documentation for `Hash` — _particularly_ as [the documentation for `Hasher`](https://doc.rust-lang.org/std/hash/trait.Hasher.html) already says “you cannot assume, for example, that a `write_u32` call is equivalent to four calls of `write_u8`” which could be sloppily interpreted as an expectation that a _good_ `Hasher` implementation will handle “quoting” of sequential calls itself. If an implementor of `Hash` fails to have this property, it could compromise the hash-DoS resistance that Rust tries to offer by default. @rustbot labels: +A-docs +T-libs-api +C-enhancement

cuviper · October 1, 2021, 4:19pm

It's not necessarily wrong if such cases hash to the same thing -- it's just an avoidable collision.

tczajka · October 1, 2021, 4:28pm

To me it feels that while it's OK for an implementation of Hasher to generate any collisions it wants on the data it's fed (you can have better or worse hashers, it's a quality issue), it should not be OK for unequal data to feed the same info to the Hasher, thus giving it no chance to succeed.

I.e. to me seems that collisions should only happen in Hasher, not in Hash, since the latter is avoidable at least for 99% of types, and those types where it's not possible should probably not implement Hash.

If you make Hash collision-free (best achieved by the prefix-free requirement), it's possible to implement a Hasher that mathematically guarantees a certain collision probability bound (using universal hashing).

If you don't have that guarantee on Hash, it becomes impossible to implement such a universal Hasher.

cuviper · October 1, 2021, 4:35pm

Hmm, okay, that's fair. (And let's switch to the issue if there's any more to say.)

bluss · October 1, 2021, 8:16pm

Hashing a slice hashes the length there - basically for the same reason, just being a terminator. In that light, just hashing one byte on a str as a terminator instead of the length, is a small feature that saves a "lot" of work (in some circumstances).

system · December 30, 2021, 8:17pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Hash Prefix collisions	13	1721	May 19, 2022
Encapsulated data in slice help	5	450	October 19, 2020
Create a noop Hasher impl? help	2	655	December 31, 2019
String manipulation tutorials	5	1282	October 10, 2023
Strange behavior with struct update syntax help	7	353	October 15, 2023

Why does `str.hash(...)` pass an extra byte to the Hasher?

Related topics