Intersection of multiple HashSet's?

While looking for an answer to the question in the title, I found this Reddit thread. I've tried to implement a function that always returns a fresh set using the fold / reduce suggestion there:

use std::collections::HashSet;
use std::hash::Hash;

fn intersections<T>(sets: &Vec<HashSet<T>>) -> HashSet<T>
where
    T: Clone + Eq + Hash,
{
    match sets.len() {
        0 => HashSet::new(),
        1 => sets[0].clone(),
        _ => sets.iter().reduce(|acc, set| acc & set).unwrap(),
    }
}

Unfortunately, the snippet above does not compile:

error[E0308]: mismatched types
  --> src/lib.rs:11:44
   |
11 |         _ => sets.iter().reduce(|acc, set| acc & set).unwrap(),
   |                                            ^^^^^^^^^
   |                                            |
   |                                            expected `&HashSet<T>`, found struct `HashSet`
   |                                            help: consider borrowing here: `&(acc & set)`
   |
   = note: expected reference `&HashSet<T>`
                 found struct `HashSet<T>`

I have trouble understanding why a borrow is expected there. Of course the help is bogus, since I can't borrow on a temporary value there. Can someone explain?

In addition, can someone explain why I need to be explicit about T supporting Eq + Hash? Shouldn't that be implied by the fact that I'm using HashSet<T>?

1 Like

I have trouble understanding why a borrow is expected there

Roughly, any Collection<T>::iter() will be an Iterator<Item = &T>, since it only takes a &Collection<T>, and therefore can't move any of the contents. There's normally an Collection<T>:: into_iter() which takes the connection by value and therefore can move out the items.

However in this case BitSet is implemented on &HashSet so you are fine, it's suggestion is accurate (and it will return a new HashSet)

In addition, can someone explain why I need to be explicit about T supporting Eq + Hash? Shouldn't that be implied by the fact that I'm using HashSet?

That is a general thing with generics, unfortunately: you need to be explicit about every bound because it's a promise you're making as part of your public interface for the function. Often types don't even have constraints on the type itself, but on only on specific impl blocks that need them, so it wouldn't even help in many cases.

Often you can take advantage of more interesting constraints, though: for example in this case you could use something like where &HashSet<T>: BitAnd

1 Like

@simonbuchan is right about the iterator part. reduce infers the type of acc from the iterator item type, which, in this case, is &HashSet<T>, and reduce cannot accumulate the result in that. However, the compiler's suggestion is indeed bogus and simplistic (though imagine what rustc could do if it had AI built in :grin:). Also, & operator creates a new HashSet on every individual intersection, so that would result in a bunch of excessive allocations when you intersect multiple sets. What works in your case is:

use std::collections::HashSet;
use std::hash::Hash;

// Note: don't need a Vec here, a slice is sufficient.
// An iterator would be more generic, of course.
// Also, you don't need to consume (move) the input sets,
// hence a slice of `&HashSet<T>`.
fn intersections<T>(sets: &[&HashSet<T>]) -> HashSet<T>
where
    T: Clone + Eq + Hash,
{
    match sets.len() {
        0 => HashSet::new(),
        _ => sets[1..].iter().fold(sets[0].clone(), |mut acc, set| {
            acc.retain(|item| set.contains(item));
            acc
        }),
    }
}

#[test]
fn test_intersections() {
    let a: HashSet<_> = [1, 2, 3, 4].into();
    let b: HashSet<_> = [2, 3, 4, 5].into();
    let c: HashSet<_> = [3, 4, 5, 6].into();
    assert_eq!(intersections::<i32>(&[]), HashSet::new());
    assert_eq!(intersections(&[&[1].into()]), [1].into());
    assert_eq!(intersections(&[&a, &b, &c]), [3, 4].into());
    assert_eq!(intersections(&[&[1,2].into(), &[3,4].into()]), HashSet::new());
}

As to why you have to specify the Eq + Hash bound, that's only because they're required for & to work:

impl<T, S> BitAnd<&HashSet<T, S>> for &HashSet<T, S>where
    T: Eq + Hash + Clone,
    S: BuildHasher + Default,

but HashSet struct itself has no bounds on it, so, for example, this function compiles just fine without the need to specify any bounds:

fn not_intersections<T>(_sets: &[HashSet<T>]) -> HashSet<T> {
    HashSet::new()
}

That would be all, but as a bonus, here's a version that accepts an iterator:

use std::collections::HashSet;
use std::hash::Hash;
use std::iter::{empty, once};

// Note: don't need a Vec here, a slice is sufficient.
// An iterator would be more generic, of course.
// Also, you don't need to consume (move) the input sets,
// hence a slice of `&HashSet<T>`.
fn intersections<'a, T>(mut sets: impl Iterator<Item = &'a HashSet<T>>) -> HashSet<T>
where
    T: Clone + Eq + Hash + 'a,
{
    match sets.next() {
        Some(first) => sets.fold(first.clone(), |mut acc, set| {
            acc.retain(|item| set.contains(item));
            acc
        }),

        None => HashSet::new(),
    }
}

#[test]
fn test_intersections() {
    let a: HashSet<_> = [1, 2, 3, 4].into();
    let b: HashSet<_> = [2, 3, 4, 5].into();
    let c: HashSet<_> = [3, 4, 5, 6].into();
    assert_eq!(intersections::<i32>(empty()), HashSet::new());
    assert_eq!(intersections(once(&[1].into())), [1].into());
    assert_eq!(intersections([a, b, c].iter()), [3, 4].into());
    assert_eq!(
        intersections(once(&[1, 2].into()).chain(once(&[3, 4].into()))),
        HashSet::new()
    );
}
2 Likes

Thanks for the explanation between iter and into_iter. IIUC, the error happens because the closure roughly has the signature Fn(T, T) -> T, but my implementation does Fn(&HashSet<T>, &HashSet<T>) -> HashSet<T>, right?

So, Collection<T>:: into_iter() will then give me Iterator<Item = T>, right? If yes, why do I get the same error using that? Shouldn't my closure now do Fn(HashSet<T>, HashSet<T>) -> HashSet<T> and thus be compliant with the required signature for reduce?

Not sure I understand. Are you saying that I should follow the suggestion by the compiler, i.e. borrow like &(acc & set)? This than fails, because the branch inside the match now returns &HashSet<T> while the others return HashSet<T>. Trying to dereference here, expectedly fails with

error[E0515]: cannot return reference to temporary value
  --> src/lib.rs:11:45
   |
11 |         _ => *sets.iter().reduce(|acc, set| &(acc & set)).unwrap(),
   |                                             ^-----------
   |                                             ||
   |                                             |temporary value created here
   |                                             returns a reference to data owned by the current function

The overarching question here is: why does it (presumably) work for the people in the Reddit thread? I mean the thread is 6 years old, so some things might have changed. Was this valid syntax in previous Rust versions?

So this is a "documentation" thing, right? Because by using HashSet<T> as annotation, T: Eq + Hash is already required by HashSet and the compiler checks this anyway.

Damn, that is smooth. Thanks a lot!

1 Like

Basically, yes.

Because of a rather confusing combination of features in this case. into_iter() is a method on IntoIterator, which is also what's used to loop over collections, and therefore there's also a IntoIterator for &Collection<T> for the collections in std, which all return borrows of their items (&T) instead. This is why you can only use a for loop over a collection directly (without any &) only once, and need to add a & to avoid moving it:

let items = vec![1, 2, 3];
let items_ref = &items;
for item in items_ref { // uses IntoIterator for &Vec<T>
  // item is &i32 here.
}
for item in items { // no &, uses IntoIterator for Vec<T>
  // item is i32, not &i32 here!
}
// error, items has been moved from:
for item in items { }

You are seeing the equivalent issue without the sugar of the for loop: you can't take the items of the argument sets by value, because that would be taking the items, moving from the collection: you are only borrowing the collection and can't destroy it like that.

This is made more confusing by another rust feature: foo.bar() will automatically get turned into (&foo).bar() if it needs it, so early on in Rust you're conditioned to expect methods to not destroy their self even without a (visible) &. To ease this confusion, there's a convention that into_ methods like into_iter, into_string, into_box etc will, but here the self being "destroyed" is only the reference. In short, you should avoid using into_iter on references, it's confusing! (I should have mentioned this in my last reply, that I meant if you had sets without a reference)

Sorry, no, I misremembered what the suggestion was: for some reason I thought you had a HashSet<T> and a &HashSet<T>, which would make the correction look like &acc & set. You can see this with unioning pretty easily:

items.fold(HashSet::new(), |acc, item| &acc | item)

but since you're intersecting, you need the more fiddly:

// "if let" is essentially a shorter "match", if you haven't seen it yet
if let Some((first, rest)) = items.split_first() {
  rest.fold(first.clone(), |acc, item| &acc & item)
} else {
  HashSet::new()
}

But as @burjui says, you don't want to be creating a new HashSet on each iteration anyways.

I don't see a version using reduce there? 6 years is a little after Rust 1.0, in 2015. Since then, the policy is effectively "don't make breaking changes" (with a couple of footnotes), so it's generally unlikely for old code to not compile, though there are "editions" that can effect things in some cases (generally just warnings becoming errors). Certainly you should never expect something like Python 2 to 3!

I think @burjui covered this? Let me know if you're still confused.

1 Like

No, it's about writing code that will compile for anyone if it compiled for you.

Rust generics are not C++ templates. Generic type parameters act as actual, semantic types. They are not, however, concrete types. There is no information as to their capabilities unless you specify trait bounds.

If this requirement of specifying trait bounds weren't in place, then you could write generic code that happens to work with some types you tried, but breaks when someone else substitutes types that you didn't foresee, but should nonetheless work. By requiring the trait bounds, and via the compiler rejecting code that relies on anything but the guarantees specified by those bounds, you get code that will work with exactly the set of types you envisioned.

1 Like

Nitpick: HashSet itself actually does not requires T: Eq + Hash. Doc

2 Likes

This is still always cloning all elements of the first HashMap which might be expensive, especially if the values are larger data structures. If the implementation of & operator on HashSet is anything to go by, creating an iterator of &Value and cloning each value that remains at the end could be beneficial, then collect the result into a new HashMap. Only downside would be that the HashMap structure couldn’t be re-used, but that structure might be of limited use anyways in cases where the resulting map is significantly smaller, and creating a HashMap from iteration should be roughly linear-time anyways.

There’s a second trick that & does, using the intersection method: This method in turn is implemented by first choosing the smaller of the hashmaps, and then filtering an iterator over that one.

This leads to the following approach:

  • first look for the smallest hashmap in the list
  • then, filter a by-reference iterator over that hashmap by whether the elements are contained in all the other maps
  • finally, clone and collect:

Here’s an implementation

fn intersections<T>(sets: &[&HashSet<T>]) -> HashSet<T>
where
    T: Clone + Eq + Hash,
{
    sets.iter()
        .enumerate()
        .min_by_key(|&(_, s)| s.len())
        .map(|(smallest_set_index, _)| {
            let (other_sets_left, [smallest_set, other_sets_right @ ..]) =
                sets.split_at(smallest_set_index) else { unreachable!() };
            let other_sets = || other_sets_left.iter().chain(other_sets_right);
            smallest_set
                .iter()
                .filter(|item| other_sets().all(|o| o.contains(item)))
                .cloned()
                .collect()
        })
        .unwrap_or_default()
}
5 Likes

Yep, I missed the opportunity with the smallest HashSet. Thanks for the tip. The funny thing is I implemented it this way before in one of my projects and completely forgotten about that.

It does not? The second sentence in the docs reads:

As with the HashMap type, a HashSet requires that the elements implement the Eq and Hash traits.

Or are you saying that creating a new HashSet does not require this? If yes, I agree, but I feel like this is somewhat unrelated, since I'm talking about the elements in this set.

Maybe, I was not clear enough about my point. By annotating anything with HashSet<T>, the compiler knows that T needs to implement Eq + Hash, no? Otherwise, T could never be placed in the HashSet in the first place. Meaning, if I have a function like

fn intersections<T>(sets: &Vec<HashSet<T>>) -> HashSet<T> {...}

the compiler already knows that T needs to implement these, no? In other words adding these bounds explicitly should technically (I know nothing about the internals) not be required.

I don't dispute that adding this explicitly is a good thing. Furthermore, it seems also be the case that the compiler requires them to be explicitly set. Maybe a better question would have been: "Is my mental model of how this works still flawed or is there no theoretical limit to let the compiler infer them automatically?"

I really appreciate all the answers here. Thanks for sticking around and helping me understand this better.

Yes, that's exactly it. The definition of the HashSet type doesn't have the Eq + Hash bounds.

No. The compiler knows nothing about the semantics of types. For all it knows, since the definition of HashSet does not imply the aforementioned bounds, you might as well be creating an empty set. Or you might only be using the set as a marker type, without ever interacting woth any instances of it – the function signature doesn't say anything about what you will do with values of that type.

The compiler is not a human programmer that performs "common sense" reasoning, and for reasons of uniformity and clarity, that's probably for the better.

So no, your mental model is not right, and the compiler couldn't infer this (short of examining the body, but that would be a violation of abstraction/encapsulation, so it's intentionally avoided. Furthermore, it's not always possible to statically analyze the body and infer such constraints, this is a mathematical limitation.)

4 Likes

It's easy enough to see this isn't the case.

There's a concept related to your intuition called implied bounds. With implied bounds, and if the Eq + Hash requirements were on the HashSet definition itself, you wouldn't need to repeat the bounds everywhere -- the presence of HashSet<T> would imply T: Eq + Hash, similar to how supertrait bounds work, or how the presence of &'a T implies T: 'a.

However,

  • It's not stabilized yet (or implemented in the current compiler as far as I know)
  • It interferes with inference, which may present challenges moving forward
  • It makes the removal of bounds a breaking change where it wasn't before
  • It is already a breaking change to add bounds to existing types, so HashSet likely won't be able to take advantage of this anyway

It would also probably be a sea-change in Rust design -- a reason to avoid bounds on structs today is to avoid repeating them everywhere, where as with implied bounds, the norm may shift to adding the bounds so you don't have to state them elsewhere. (I'm personally not sure which is better or worse; I think implied bounds will make it easier to paint yourself into a corner semver wise, and to make code a lot less obvious as functions will potentially have a ton of unstated bounds imported from far away.)

6 Likes

Oh wow, thanks a lot for the counter example @quinedot. This basically renders my points above moot. Thanks to others as well being patient and providing some additional points!


Maybe, as a (hopefully) final question: why does your counter example work though? Going by the docs, T should implement Eq + Hash. Since we are clearly not just creating a new set, but actually inserting into one, why don't we need to put these bounds explicitly?

Looking at the source, insert(&mut self, value: T) is part of the impl<T, S> HashSet<T, S> where T: Eq + Hash, ... {...} block. So, for me it seems set.insert(t) should require T: Eq + Hash for t: T. What am I missing here?

Did you click run? It gives a compile error saying that the method exists, but you can't use it because you didn't require the bounds on your function declaration.

If you comment out the insert, everything compiles, because you're not doing anything that requires T: Eq + Hash or any other (non-implicit) bounds.

3 Likes

Let me just quote others above.

1 Like

Yeah, I think I got it now. Even though each element T of HashSet<T> needs to support Eq + Hash a function does not need to have these bounds unless it is using method of a HashSet that requires them. For example, neither new() or .retain() require them, whereas from_iter and insert do. In other words, these bounds are only required in case in case I actually want to insert elements some way or another into the set. If I just create an empty one or remove elements and so on, these bounds are irrelevant.

Is that correct now?

1 Like

Almost. It's not about inserting vs. removing elements, it's whether the hash set internally needs to perform operations specific to a hash set. So, in particular, .remove() still absolutely does need the Hash + Eq bounds, because hashing and equality comparison is how you locate the specific element to be removed.

Filtering is not "removal", and it doesn't need the bounds because filtering is not an operation specific to hash sets. Filtering works with any iterable, and .retain() works regardless of the key type, since these operations look at every element anyway (so no hashing-based optimization occurs), and the predicate is provided externally (so items don't need to be intrinsically equality-aware). Furthermore, removal of an element of which the address or index in the hash table is known (since we just got it by iterating over the set itself) doesn't require further disambiguation, hashing, or equality checks.

5 Likes

Thanks a lot for the last puzzle pieces. Everything above makes much more sense now!

Yes, these decisions aren't ad-hoc – they have good (technical or stylistic) reasons. Generally, it's idiomatic to only constrain types as much as absolutely necessary so as to make generic types and functions easier to use. This in turn means that bounds are often basically "technical", i.e., necessary for the functioning of the code, not slapped on "just to be sure". (The former does occur though, but it's mostly in code that tries to perform type-level computation or abstract away less-strongly-typed, lower level details.)

2 Likes