What hapens if `Hash` and `PartialEq` don't match when using `HashMap`?

HashMap and HashSet indicate k1 == k2 -> hash(k1) == hash(k2) must hold when implementing PartialEq, Eq, and Hash.

  1. Is this an "implies" or an "if and only if"? My guess is "implies" because that would mean hash(k1)==hash(k2) -/> k1 == k2 which follows from hash collisions existing.
  2. Why is this a requirement?
  3. What happens if this is not the case?
    Say we have:
#[derive(Eq)]
struct S {
    a: u8,
    b: u8,
}

impl std::hash::Hash for S {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        self.a.hash(state);
        self.b.hash(state);
    }
}

impl PartialEq for S {
    fn eq(&self, other: &Self) -> bool {
        self.a == other.a
    }
}

what happens? I would assume that the <S as Hash> implementation is used to find the bucket to place the item, <S as PartialEq> is used to check there isn't a hash collision, and the element in the bucket is only replaced if it has the same a value, and if a doesn't match it is chained/re-probed. It seems to me this would only cause some elements which have an equivalent a to be placed in different buckets (because they have different b) without causing any logic issues. I assume I'm missing something?

1 Like

This is the logic issue. It breaks the rule that a hash table should contain no duplicate keys as defined by Eq. And then if it is resized, the elements might or might not then end up in different buckets, so the hash table isn't even consistent with itself over time.

In general, if Hash hashes more than Eq compares, then hash table elements may end up effectively invisible, at random, which isn't a property you want in your data structures.

2 Likes

If the Eq and Hash implementations for a type are inconsistent, the hash table is allowed to arbitrarily misbehave, so long as it doesn't trigger UB.

(It can return wrong things, it can fail to return things, it can panic, etc. Just so long as it stays safe.)

3 Likes

Ok, so in the case of resizing the issue is if A and B hash to different buckets but on resize they hash to the same bucket then the second to be inserted in the newly sized table replaces the first if A==B.
This isn't an issue if Hash is less specific than Eq because two items won't replace each other regardless of a collision.
If the hash table doesn't resize is there still an issue?

Resizing was only an example of an internally detectable failure, where the hash table's own algorithms could notice a problem. Even without that, the hash table will not be behaving consistently according to its documented properties. It stops being a collection in which you can reliably look things up.

The problem is not that the hash table will necessarily fail in some definite way, like a panic. The problem is that it can't be a useful collection under these conditions.

3 Likes

So this is considered a precondition of the data structure? Not following this condition means no behavior is guaranteed?
What is the documented properties you refer to? Looking at the documentation in std it seems to mostly rely on already knowing what a hash table is. My guess from this discussion is that its something like "for all keys x,y: map[x] is map[y] if and only if x==y" and hash is more like an implementation detail.

This is the documented properties. As you may already found it is mentioned in the HashMap documentation.


It is required that the keys implement the Eq and Hash traits, although this can frequently be achieved by using #[derive(PartialEq, Eq, Hash)]. If you implement these yourself, it is important that the following property holds:

k1 == k2 -> hash(k1) == hash(k2)

In other words, if two keys are equal, their hashes must be equal.

1 Like

Yes. Hash tables need correct hashes.

Not following this condition means no behavior is guaranteed?

Yes. As the documentation says:

The behavior resulting from such a logic error is not specified, but will be encapsulated to the HashMap that observed the logic error and not result in undefined behavior. This could include panics, incorrect results, aborts, memory leaks, and non-termination.

When I previously said “In general, if Hash hashes more than Eq compares, then hash table elements may end up effectively invisible, at random, which isn't a property you want in your data structures” — I was describing a specific example of how the table might misbehave. HashMap reserves the right to misbehave other ways too.

My guess from this discussion is that its something like "for all keys x ,y : map[x] is map[y] if and only if x==y "

That doesn't quite hold — the user is free to insert equal values under different keys, if that's what they want to get.

hash is more like an implementation detail.

Sort of, yes. The rule is: if your type correctly implements Hash, then you can use it as a key in a HashMap and get results that conform to the documentation of the operations on HashMap. Similarly, if your type correctly implements Ord, then you can use it in a BTreeMap. The exact way the container type uses the trait implementation, and therefore the consequences of an incorrect implementation, are implementation details. When the implementation is correct, then these containers can be used without needing to think about the fact that they are based on hashing (or ordered comparison).

Yes. As the documentation says:

The behavior resulting from such a logic error is not specified, but will be encapsulated to the HashMap that observed the logic error and not result in undefined behavior. This could include panics, incorrect results, aborts, memory leaks, and non-termination.

The documentation says this about

It is a logic error for a key to be modified in such a way that the key’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the map...The behavior resulting from such a logic error...

It does not say this about violating the k1 == k2 -> hash(k1) == hash(k2). If that is meant to apply to both then the documentation needs updating.

That doesn't quite hold — the user is free to insert equal values under different keys, if that's what they want to get.

Right. I used "is" as "corresponds to the same value location as" not ==. In which case it sounds like that description is accurate. "For all keys x ,y : map[x] corresponds to the same value location as map[y] if and only if x==y "

Yeah, that's true. The documentation doesn't say it, but all the other things apply to having an inconsistent Hash and Eq implementation. The standard library makes no guarantees about behavior besides safety. There are no missing guarantees (safety is guaranteed by the lack of unsafe keyword). Maybe the documentation could be updated to include inconsistent Hash and Eq in the list of things that can cause logic errors.

Realistically, you may see that, when inserting two values where Hash is different but Eq is the same, the map ends up with both values. This could result in one value only being deletable with clear or retain. The map may still work right now for a certain task, but a non-breaking, internal change to the standard library could make it not work anymore. What makes this complicated is that the main difference between hash table implementations is what happens when a hash collision is found, so the actual behavior you see will be dependent on the kind of hash table, as well as the random state of the hasher and the capacity of the map. Good hash table implementations will make collisions result in a range of behaviors in order to decrease the chance that patterns in the data ruin the O(1) time complexity of map operations.

It's good advice to learn what a hash table is anyway, since it's a common and useful data structure. The standard library does mostly assume you are at least somewhat familiar, and if you understand it well, it will be easy to figure out the kinds of problems that an inconsistent Hash and Eq could cause. However, you don't need to know anything about the specific type of hash table. The API only consists of things common to basically any type of hash table, and the standard library can (and has in the past) change the type of hash table it uses.

Other languages can make it impossible to have inconsistent Hash and Eq, like by always hashing and eq'ing all the primitives in the key. So Rust gives more flexibility in exchange for some user requirements.

Well, I would say the docs are ambiguous. Let's look at the whole part:

It is required that the keys implement the Eq and Hash traits, although this can frequently be achieved by using #[derive(PartialEq, Eq, Hash)]. If you implement these yourself, it is important that the following property holds:

k1 == k2 -> hash(k1) == hash(k2)

In other words, if two keys are equal, their hashes must be equal.

It is a logic error for a key to be modified in such a way that the key’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the map.

While the documentation doesn't explicitly say that the violation of k1 == k2 -> hash(k1) == hash(k2) is a "logic error" with the subsequently explained consequences, it still says that this property is "important". It doesn't say that HashMap or HashSet will function correctly if that property is violated. From context, it's clear that errors can happen if you violate this "important" property, and the documentation doesn't tell you which errors can happen. So basically anything (reasonable?) is possible except undefined behavior (UB) due to Rust's safety guarantees.

If you think the documentation could/should be improved in that matter, you could open a documentation issue, I guess.

There was a previous discussion on the docs in that matter, which you can read about on IRLO: Using std::collections and unsafe: anything can happen. That discussion resulted in PR #97316.

I would say it's correct for it to not give any more details than this. Going that way -- saying too much about an implementation -- is how you end up with things like unordered_map in C++ that promised too much and thus can't be implemented as efficiently as Rust's HashMap.

2 Likes

The thing is that PR #97316 gives an explanation about which errors may happen but only in regard to modified keys, and not in regard to other errorneous implementations of Eq and Hash (in particular the violation of k1 == k2 -> hash(k1) == hash(k2)).

I think that's what @taperedDominoes pointed out, if I understand right.


However, if Eq and Hash are implemented in malicious ways, then of course HashSet will also be malicious. So I think the docs are good as they are now.

Yes. My confusion was caused by the description indicating the effects of logic errors of in-place-modified keys without indicating it also applied to invalid Hash - Eq implementation.

The authors of the documentation are humans, and they expect that users interpret the documentation sensibly and in good faith. Hence, it's obvious that you need to uphold all guarantees of the map, not just immutability of keys, in order to get correct behavior.

If you want to intentionally misrepresent the documentation, you'll likely be able to, since it's not a formal, mechanistically-checkable specification. (That would be called "code", and it's not useful for the purpose that documentation is written for.)

1 Like
  • As indicated in #97316 confusion/clarification on this matter is not unreasonable.
  • The docs is a description of intention even in the event that the code itself changes. The docs distinguish what is vs what is not an implementation detail.
  • Even if I fully understood all aspects of the code that's mostly implementation details and could be changed in a future version. If I only cared about what happened at this particular point in time with this particular version of tools I would just write code, test with whatever level of rigor is required, and if it works no further thought necessary.
  • Hence, it's obvious that you need to uphold all guarantees of the map

It was not originally obvious to me that this was a required precondition and not a "be extra sure you know what you're doing/that it works if you do this" condition.

I understand now that behavior relying on invalid Hash - Eq implementations may break in minor version changes in the future and the question posed in this topic has been answered.

2 Likes

While I don't see a big problem, I think it's perfectly fine to criticize Rust's documentation. Rust is known to lack normative specifications and this includes unsafe Rust and when UB will happen.

Some exmples:

Finally, this book is not normative. It may include details that are specific to rustc itself, and should not be taken as a specification for the Rust language. We intend to produce such a book someday, and until then, the reference is the closest thing we have to one.

Rust does not yet have a defined memory model. Various academics and industry professionals are working on various proposals, but for now, this is an under-defined place in the language.

  • There are several competing concepts (at least two: Stacked Borrows and Tree Borrows) in regard to aliasing rules.
    • As discussed here, there are several competing (non-official-yet) concepts how to solve this:
  • It has been discussed whether it's okay to have a mutable reference to a slice with uninitialized memory:

All these examples circle around the boundary between safe and unsafe Rust. Documentation could and should be improved, especially considering that safety is one of Rust's core features.

I think it's perfectly reasonable to stumble upon the documentation here because when wanting to avoid critical errors, one should be very cautious to not get caught by surprises and end up writing unsound code.

Ideally, a programming language has a specification that is normative and unambiguous. It's not always possible in practice to achieve that goal 100%, but it should be pursued nontheless. In my opinion, pointing out errors or ambiguities (or even parts that could be misunderstood) isn't bad faith but an opportunity to improve Rust and to close some of the many holes we can still witness in the documentation (examples given above).

Of course, care must be taken not to do a disservice by overcomplicating things and making the documentation de-facto less clear to read. I think it's a tricky exercise.

I personally appreciate observations like the one made by the OP. :slightly_smiling_face:

Last but not least, I would like to add that criticizing Rust or its documentation doesn't mean criticizing the people who wrote it. All involved people have done an amazing job!

2 Likes

They also all involve the not-yet-existent specification of Rust the language, whereas this thread is primarily concerned with a std library type's documentation and behavior. std's documentation is generally considered normative for the library.[1]

IMO all that's really needed for the OP is some slight rewording to make more clear that the "anything except UB may happen" section applies to all "illogical" Hash/Eq situations.


  1. I'm sure someone could track down exceptions, but breaking promises made in documentation is not something done lightly. ↩︎

2 Likes

This thread really isn't about UB, so maybe bringing up the examples that involve a lack of a formal language specification are a bit misleading here (but still shows how some fundamental parts of the docs could use some improvements).

However, the problem with HashSet and HashMap is that there was the previous problem of "anything can happen but UB" is a difficult statement (as has been discussed on IRLO and the linked PR), so the whole issue somehow is linked to Rust's safety guarantees and when UB may happen.


I think the concept of "logic errors" is pretty much related to how Rust's docs (whether std or the language reference) attempt to differentiate between errors that may lead to UB and errors that may not lead to UB. The reference on logic errors (link above) still says:

Violating such constraints is not considered unsafe, yet the program is considered erroneous and its behavior unpredictable.

Overall, it's a difficult task for a programmer to understand this concept of "logic errors" and what "unpredictable behavior" means. It would generally be nice to be unambiguous regarding when such behavior arises and what it includes. The latter part has been tackled by #97316, but the former part seems to be incomplete, as pointed out by the OP. (If I understand right.)

1 Like

It means misbehaving arbitrarily as long as it's not UB. So your program will potentially have bugs, namely it will operate on incorrect/inconsistent data/results, but it will not violate memory safety.