Pearl: Folding an `Iterator<Item = &HashSet<T>>`

std provides HashSet::intersection, but this has some considerations:

  1. Both self and the other HashSet must be borrowed.
  2. The Itersection iterator yields Item = &'a T
  3. It only operates over two HashSets at a time.

I've just hit a case again where I have Iterator<Item = &HashSet<T>> and I want to Iterator::fold (or reduce) these according to intersection logic. However, those functions require some owned acc to be passed through, but it's not clear to me how to achieve that, given that:

  • The zero value can't start as an empty HashSet or else intersection won't work.
  • A la reduce, I can't have the zero be &HashSet (say, the first element) because there's no way to produce one from the intersection of two HashSets in later iterations.

I'm obviously trying to produce a final HashSet (or Iterator) of values common to all the parent HashSets. How should I go about it?

Thank you kindly.

I couldn't think of a way to fold iterators that didn't involve nested boxes as deep as the number of hash sets (to erase a recursive type of unknown depth), so I figure collecting the sets is more palatable.

// "iter-section", get it?
fn iter_section<'hs, I, T>(mut iter: I) -> impl Iterator<Item = &'hs T>
where
    I: Iterator<Item=&'hs HashSet<T>>,
    T: 'hs + Eq + Hash,
{
    let root = iter.next().into_iter().flat_map(|hs| hs.iter());
    let rest: Vec<_> = iter.collect();
    root.filter(move |item| {
        rest.iter().all(|hs| hs.contains(item))
    })
}

Playground.

If you could take a slice of HashSet, it could be nicer (?).

3 Likes

The Intersection iterator doesn't do anything fancy: it just loops over the first set, calling contains on the second set for every value. So we can extend that logic to more than 2 sets using HashSet::retain() (Rust Playground):

fn intersection<'a, T, I>(sets: I) -> HashSet<T>
where
    T: Clone + Eq + Hash + 'a,
    I: IntoIterator<Item = &'a HashSet<T>>,
{
    let mut sets = sets.into_iter();
    let mut result = match sets.next() {
        Some(set) => set.clone(),
        None => return HashSet::new(),
    };
    for set in sets {
        result.retain(|x| set.contains(x));
        if result.is_empty() {
            break;
        }
    }
    result
}

Effectively, this solution collects all the values together, while @quinedot's solution collects all the sets together. Which is better depends on how you are generating the sets.

3 Likes

Thanks for those formulations. In my own attempts so far I haven't been able to formulate something that avoids internal allocations either, even though it feels like it should be morally possible.

I'll sleep on it.

The problem is, each candidate element from the first set must be checked against every other set to determine if it should be retained. If you iterate over elements one at a time, then all of the sets must be collected, since otherwise the Iterator of sets would be entirely consumed by the second element. (Unless your Iterator can be cheaply restarted, that is.) So if you want to iterate over sets one at a time, then all of the elements must be collected, so that each new set can be tested against every element before being dropped. There's no way around this AFAIK.

2 Likes

Or, to put it this way: the result of the intersection operation can be represented most minimally as a bitset, which indicates whether a particular element is in every set. Since neither of the individual sets contains this information, it needs extra bits to be created somewhere to store the results. Unless you are willing to recompute this information on the fly for every element you are interested in, there is no way to store a bunch of data without it consuming a dynamic amount of space.

2 Likes

I had thought of that, but in the general case we can't guarantee that the Iterator contents didn't come from somewhere expensive :frowning:

Which is effectively the retain solution, where you clone the first set in the collection and then at least have something owned to work with. I may have to go with that.

If callbacks are acceptable, you can avoid heap allocations by using the stack as storage via recursion.

Playground which I made no attempt to optimize or clean up, e.g. linked list traversal should probably just be a loop. Still probably less performant than the other approaches regardless.

This still generates every iterator though, which is apparently the costly part. I guess this means you're often expecting to generate an empty set.

Edit: The below can't work as outlined because you'll be popping the HashSets as you unwind. That's what I get for thinking out loud :sweat_smile:.

I think it could be built upon like so, however:
  • Recurse on items in the first hashset instead; recursion returns a StackLink
  • At the bottom, build up a StackLink from the iterator, testing against the last item
    • Found a HashSet that doesn't contain the item? Return the StackLink
    • Iterator finished? Call f(item), then return StackLink
  • Handling returns / unwinding the recursion
    • Check item at this level against current StackLink
    • If it succeeds, do the check-iterator-and-maybe-extend-StackLink procedure
    • Else just return the StackLink

But that seems like an awful lot of work to avoid allocation.

3 Likes

I guess they always are in some sense given the rest of this topic:

fn intersection_set<'hs, T, I, F>(mut iter: I) -> HashSet<T>
where
    I: Iterator<Item=&'hs HashSet<T>>,
    T: 'hs + Eq + Hash + Clone,
{
    let mut hs = HashSet::new();
    intersection(iter, |item| { hs.insert(item.clone()); });
    hs
}

That is really clever, I must admit.

1 Like

Is there a missing F in the argument list?

For intersection_set? No, it's just left-over from a copy-paste and the F parameter should be removed.

A messy PoC for the recurse-items-delay-HashSet approach. I'm not sure there's a safe (as in no unsafe) way to actually winnow the linked list, as making next mutable introduces invariance in the stack lifetime. So aside from being a stack overflow hazard, a large initial set could be quite slow to work with.

You could keep a cursor to the first still-retained position in the list for some marginal improvement. (I didn't spend much time convincing myself I did this one right.)

1 Like

And finally, a version that uses the stack for the HashSets like before, but only creates them lazily (when encountering an item that was present in all previous sets). This is sort of like the idea I outlined that doesn't work, but we pass a current item down the stack don't unwind until we're done.

Probably could be cleaned up a bit. Still not convinced it's better than allocation, but I guess that's use-case specific.

1 Like

You can likely speed it up by processing the HashSets in order of size. This should avoid a lot of comparisons if the first couple HashSets are huge but the third is tiny. Just need to allocate a Vec<&HashSet> which should be pretty small overhead especially if the first HashSet is huge (since it would get cloned).

That latest exercise was to avoid collecting the HashSets / minimize calls to the HashSet producing iterator while avoiding clones / allocation, but yes, if you collect them that would be a worthy consideration.

All you really need is the smallest, given how retain works and assuming a non-horrible hasher. [1]


  1. The playground is untested. ↩ī¸Ž

1 Like

Larger HashSets are more likely to have cache misses, so ideally we minimize the number of calls to "contains" against them. Mine maybe worse if all the HashSets happily fit in the CPU Cache.

1 Like

For my specific use-case I expect a low N for the number of HashSets in the Iterator, but the sets themselves could be quite large. So while I would never suggest this is the best general solution, for now I'm going to go with this, a fusion of @LegionMammal978 's solution and the pre-sorting suggestion by @Cocalus :

fn intersections<'a, I, T>(iter: I) -> HashSet<&'a T>
where
    I: Iterator<Item = &'a HashSet<&'a T>>,
    T: Eq + Hash + 'a,
{
    let mut sets: Vec<_> = iter.collect();
    sets.sort_by(|a, b| b.len().cmp(&a.len()));

    match sets.pop() {
        None => HashSet::new(),
        Some(smallest) => {
            let mut res = smallest.clone();
            sets.iter().for_each(|s| res.retain(|t| s.contains(t)));
            res
        }
    }
}

Yes I realize I'm not breaking out early, but again since I expect low N, that's not the end of the world. I'd prefer the cleaner code.

Another specific point about my use case: I know that the Ts themselves are also borrowed, so the clone shouldn't actually hurt. Note that T has no Clone constraint here, guaranteeing that we're cloning the references, not the underlying data.

Out of curiosity, how are these HashSets being generated? You seemed concerned about generation time at first [1], yet this isn't a lending iterator situation, and you're receiving references to existing-elsewhere HashSets.


  1. or I just misunderstood ↩ī¸Ž

Slightly generalized and simplified (Playground):

fn intersections<'a, I, T>(iter: I) -> HashSet<&'a T>
where
    I: IntoIterator<Item = &'a HashSet<&'a T>>,
    T: Eq + Hash + 'a,
{
    let mut sets: Vec<_> = iter.into_iter().collect();
    sets.sort_by_key(|x| x.len());

    sets.split_first().map_or_else(HashSet::new, |(&smallest, rest)| {
        let mut result = smallest.clone();
        for s in rest {
            result.retain(|t| s.contains(t));
        }
        result
    })
}
4 Likes