Intersecting multiple HashSet<String>

sebastianpoeplau · October 31, 2019, 10:31am

Hi!

I'm trying to get the intersection of multiple sets of strings, and I'm wondering if there's a more elegant way than what I'm using so far:

let binaries: Vec<HashSet<String>> = ...
let common_binaries: HashSet<&str> = binaries[0]
    .iter()
    .filter(|b| binaries[1..].iter().all(|set| set.contains(*b)))
    .map(|b| &b[..])
    .collect();

I'm computing the intersection manually via "all" and "contains" because there doesn't seem to be an intersection function that takes multiple sets (Support for Intersection of multiple Sets at the same time · Issue #2023 · rust-lang/rfcs · GitHub). However, the explicit accesses to the first and the remaining elements of the vector don't seem very elegant...

Then I'm not sure about String vs str: the String instances in "binaries" outlive the intersection, so I'd like to use references to avoid copying the strings. I guess an alternative to the above would be to use "&String" instead of "&str", which would allow me to drop the call to "map" - would that be idiomatic?

Thanks for your help!
Sebastian

hashedone · October 31, 2019, 11:51am

If you are bound to use HashSet, I don't see any other way. The problem of this solution is not its complexity. If we take N to be the number of sets, and M to be number of elements in the biggest set, the complexity seems to be like O(N * M), and I don't think it could be done in better complexity.

The real problem is that it seems to have terrible cache eficiency for big sets. I wonder, if the better solution would be collect the elements to vectors, sort them, and after all perform merging like in last step of merge sort, but dropping repeating values in the process. My intuition is, that it would be much better for cache for small number of big sets (which tends to be a more common case than big number of small sets to merge).

It terms of elegance - your solution seems pretty optimal for me.

newbie101 · October 31, 2019, 12:37pm

I think you can do it with repeated intersection. Unfortunately, my attempt is a bit ugly:

    let intersect = binaries
        .iter()
        .fold(None, |acc: Option<HashSet<&str>>, hs| {
            let hs = hs.iter().map(|s| s.deref()).collect(); // horrible
            acc.map(|a| a.intersection(&hs).map(|s| *s).collect())
                .or(Some(hs))
        });

Playground

I like the fold with Option but the ref/deref cycle is annoying.

tuffy · October 31, 2019, 1:41pm

Seems like a little function to turn those HashSet<String> into HashSet<&str> would make a solution a lot cleaner, like:

fn hash_to_ref(h: &HashSet<String>) -> HashSet<&str> {
    h.iter().map(|s| s.as_str()).collect()
}

let common_binaries: HashSet<&str> = if let Some((first, rest)) = binaries.split_first() {
    rest.iter().fold(hash_to_ref(first), |acc, i| {
        acc.intersection(&hash_to_ref(i)).copied().collect()
    })
} else {
    HashSet::new()
};

though constructing a lot of temporary HashSets may not provide optimal performance.

alice · October 31, 2019, 1:57pm

I think that the most idiomatic solution would be to use map(String::as_str) at the end of the chain. I would definitely not consider creating a lot of temporary hash sets idiomatic in this case.

sebastianpoeplau · October 31, 2019, 2:02pm

The fold is nice indeed! I tried it before but didn't come up with the idea of using Option<_> for the accumulator...

sebastianpoeplau · October 31, 2019, 2:11pm

Right, I'm not concerned about the algorithmic complexity of my solution - it's more the feeling that I'm trying to do something relatively simple but writing a lot of code to express it...

Regarding cache efficiency, I believe both C++ and D favor your approach of walking over sorted vectors. Maybe that's an indication that someone evaluated it and found it faster than hash-based approaches...

tuffy · October 31, 2019, 2:40pm

A less functional brute-force approach might be to just do a flat_map and count the binaries, which should perform well enough but not look as nice:

let mut binaries_count: HashMap<&str, usize> = HashMap::new();
for s in binaries.iter().flat_map(|h| h.iter().map(String::as_str)) {
    *binaries_count.entry(s).or_insert(0) += 1;
}
let common_binaries: HashSet<&str> = binaries_count
    .iter()
    .filter(|(_, &count)| count == binaries.len())
    .map(|(&s, _)| s)
    .collect();

Hocuri · October 31, 2019, 4:45pm

Yet another approach (should be more cache efficient because each the first HashSet is queried, then the second one and so on):

let mut common_binaries = binaries.pop().unwrap().iter();
for current_set in binaries {
    common_binaries = common_binaries.filter(|it| current_set.contains(*it));
}

But I get an error because there seem to be different iter types. This works:

let mut common_binaries: HashSet<&str> = binaries.pop().unwrap();
for current_set in binaries {
    common_binaries = common_binaries.iter().filter(|it| current_set.contains(*it)).cloned().collect();
}

but now I have a lot of iter() and collect().

Also, it is rather imperative; should be possible to do this via fold, though.

jjpe · October 31, 2019, 5:10pm

Assuming a single lookup takes O(log n) time, the intersection of 2 sets is O(n log n). This can be accomplished e.g. by sorting the 2 sets into Vec's, and then scanning the 2 Vecs in parallel to each other. This latter part keeps the "is value in set" checks at O(n) rather than a naive check which would be O(n²).

The above would make repeated intersection O(n² log n) because the operation is repeated n-1 times i.e. calculating the intersection of N sets is a fairly expensive operation.

If lookup takes O(1), then set intersection is O(n) and repeated set intersection ends up at O(n²).

alice · October 31, 2019, 5:21pm

Hash sets have O(1) lookup.

jjpe · October 31, 2019, 5:22pm

Then this part holds

Hocuri · November 3, 2019, 7:08pm

This might be an implementation for your idea @hashedone

Hocuri · November 3, 2019, 7:23pm

And, thanks to @alice, another correct solution: playground

alice · November 3, 2019, 7:37pm

Note that I don't recommend boxing a whole lot of iterators like this. The main idea in the given playground is the use of Option::into_iter, which definitely does not require the use of that kind of deep indirection.

system · February 1, 2020, 7:37pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
HashSet<String> intersection HashSet<&str> help	10	4159	August 18, 2019
Intersection of multiple HashSet's? help	21	1419	March 6, 2023
Intersection of hashsets help	4	2452	December 9, 2019
Efficiently computing set intersection and difference help	7	614	May 2, 2020
Returning HashSet intersection from function help	5	416	March 19, 2023

Intersecting multiple HashSet<String>

Related Topics