Then I'm not sure about String vs str: the String instances in "binaries" outlive the intersection, so I'd like to use references to avoid copying the strings. I guess an alternative to the above would be to use "&String" instead of "&str", which would allow me to drop the call to "map" - would that be idiomatic?
If you are bound to use HashSet, I don't see any other way. The problem of this solution is not its complexity. If we take N to be the number of sets, and M to be number of elements in the biggest set, the complexity seems to be like O(N * M), and I don't think it could be done in better complexity.
The real problem is that it seems to have terrible cache eficiency for big sets. I wonder, if the better solution would be collect the elements to vectors, sort them, and after all perform merging like in last step of merge sort, but dropping repeating values in the process. My intuition is, that it would be much better for cache for small number of big sets (which tends to be a more common case than big number of small sets to merge).
It terms of elegance - your solution seems pretty optimal for me.
I think that the most idiomatic solution would be to use map(String::as_str) at the end of the chain. I would definitely not consider creating a lot of temporary hash sets idiomatic in this case.
Right, I'm not concerned about the algorithmic complexity of my solution - it's more the feeling that I'm trying to do something relatively simple but writing a lot of code to express it...
Regarding cache efficiency, I believe both C++ and D favor your approach of walking over sorted vectors. Maybe that's an indication that someone evaluated it and found it faster than hash-based approaches...
Yet another approach (should be more cache efficient because each the first HashSet is queried, then the second one and so on):
let mut common_binaries = binaries.pop().unwrap().iter();
for current_set in binaries {
common_binaries = common_binaries.filter(|it| current_set.contains(*it));
}
But I get an error because there seem to be different iter types. This works:
let mut common_binaries: HashSet<&str> = binaries.pop().unwrap();
for current_set in binaries {
common_binaries = common_binaries.iter().filter(|it| current_set.contains(*it)).cloned().collect();
}
but now I have a lot of iter() and collect().
Also, it is rather imperative; should be possible to do this via fold, though.
Assuming a single lookup takes O(log n) time, the intersection of 2 sets is O(n log n). This can be accomplished e.g. by sorting the 2 sets into Vec's, and then scanning the 2 Vecs in parallel to each other. This latter part keeps the "is value in set" checks at O(n) rather than a naive check which would be O(n²).
The above would make repeated intersection O(n² log n) because the operation is repeated n-1 times i.e. calculating the intersection of N sets is a fairly expensive operation.
If lookup takes O(1), then set intersection is O(n) and repeated set intersection ends up at O(n²).
Note that I don't recommend boxing a whole lot of iterators like this. The main idea in the given playground is the use of Option::into_iter, which definitely does not require the use of that kind of deep indirection.