Intersecting multiple HashSet<String>

Hi!

I'm trying to get the intersection of multiple sets of strings, and I'm wondering if there's a more elegant way than what I'm using so far:

let binaries: Vec<HashSet<String>> = ...
let common_binaries: HashSet<&str> = binaries[0]
    .iter()
    .filter(|b| binaries[1..].iter().all(|set| set.contains(*b)))
    .map(|b| &b[..])
    .collect();

I'm computing the intersection manually via "all" and "contains" because there doesn't seem to be an intersection function that takes multiple sets (https://github.com/rust-lang/rfcs/issues/2023). However, the explicit accesses to the first and the remaining elements of the vector don't seem very elegant...

Then I'm not sure about String vs str: the String instances in "binaries" outlive the intersection, so I'd like to use references to avoid copying the strings. I guess an alternative to the above would be to use "&String" instead of "&str", which would allow me to drop the call to "map" - would that be idiomatic?

Thanks for your help!
Sebastian

If you are bound to use HashSet, I don't see any other way. The problem of this solution is not its complexity. If we take N to be the number of sets, and M to be number of elements in the biggest set, the complexity seems to be like O(N * M), and I don't think it could be done in better complexity.

The real problem is that it seems to have terrible cache eficiency for big sets. I wonder, if the better solution would be collect the elements to vectors, sort them, and after all perform merging like in last step of merge sort, but dropping repeating values in the process. My intuition is, that it would be much better for cache for small number of big sets (which tends to be a more common case than big number of small sets to merge).

It terms of elegance - your solution seems pretty optimal for me.

I think you can do it with repeated intersection. Unfortunately, my attempt is a bit ugly:

    let intersect = binaries
        .iter()
        .fold(None, |acc: Option<HashSet<&str>>, hs| {
            let hs = hs.iter().map(|s| s.deref()).collect(); // horrible
            acc.map(|a| a.intersection(&hs).map(|s| *s).collect())
                .or(Some(hs))
        });

Playground

I like the fold with Option but the ref/deref cycle is annoying.

Seems like a little function to turn those HashSet<String> into HashSet<&str> would make a solution a lot cleaner, like:

fn hash_to_ref(h: &HashSet<String>) -> HashSet<&str> {
    h.iter().map(|s| s.as_str()).collect()
}

let common_binaries: HashSet<&str> = if let Some((first, rest)) = binaries.split_first() {
    rest.iter().fold(hash_to_ref(first), |acc, i| {
        acc.intersection(&hash_to_ref(i)).copied().collect()
    })
} else {
    HashSet::new()
};

though constructing a lot of temporary HashSets may not provide optimal performance.

I think that the most idiomatic solution would be to use map(String::as_str) at the end of the chain. I would definitely not consider creating a lot of temporary hash sets idiomatic in this case.

The fold is nice indeed! I tried it before but didn't come up with the idea of using Option<_> for the accumulator...

Right, I'm not concerned about the algorithmic complexity of my solution - it's more the feeling that I'm trying to do something relatively simple but writing a lot of code to express it...

Regarding cache efficiency, I believe both C++ and D favor your approach of walking over sorted vectors. Maybe that's an indication that someone evaluated it and found it faster than hash-based approaches...

1 Like

A less functional brute-force approach might be to just do a flat_map and count the binaries, which should perform well enough but not look as nice:

let mut binaries_count: HashMap<&str, usize> = HashMap::new();
for s in binaries.iter().flat_map(|h| h.iter().map(String::as_str)) {
    *binaries_count.entry(s).or_insert(0) += 1;
}
let common_binaries: HashSet<&str> = binaries_count
    .iter()
    .filter(|(_, &count)| count == binaries.len())
    .map(|(&s, _)| s)
    .collect();
1 Like

Yet another approach (should be more cache efficient because each the first HashSet is queried, then the second one and so on):

let mut common_binaries = binaries.pop().unwrap().iter();
for current_set in binaries {
    common_binaries = common_binaries.filter(|it| current_set.contains(*it));
}

But I get an error because there seem to be different iter types. This works:

let mut common_binaries: HashSet<&str> = binaries.pop().unwrap();
for current_set in binaries {
    common_binaries = common_binaries.iter().filter(|it| current_set.contains(*it)).cloned().collect();
}

but now I have a lot of iter() and collect().

Also, it is rather imperative; should be possible to do this via fold, though.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c813341598b2a1f3842718864d165b2b

1 Like

Assuming a single lookup takes O(log n) time, the intersection of 2 sets is O(n log n). This can be accomplished e.g. by sorting the 2 sets into Vec's, and then scanning the 2 Vecs in parallel to each other. This latter part keeps the "is value in set" checks at O(n) rather than a naive check which would be O(n²).

The above would make repeated intersection O(n² log n) because the operation is repeated n-1 times i.e. calculating the intersection of N sets is a fairly expensive operation.

If lookup takes O(1), then set intersection is O(n) and repeated set intersection ends up at O(n²).

Hash sets have O(1) lookup.

Then this part holds :slight_smile:

This might be an implementation for your idea @hashedone

And, thanks to @alice, another correct solution: playground

Note that I don't recommend boxing a whole lot of iterators like this. The main idea in the given playground is the use of Option::into_iter, which definitely does not require the use of that kind of deep indirection.