Efficiently computing set intersection *and* difference

I'm very new to Rust so bear with me if this is a rookie/stupid question. Note also I come from a C++ background, so feel free to provide analogies with C++ if that helps.

I'm trying to implement a function which will take two (Hash)Sets A and B and do the following:

  • Modify A -> A n (A\B)
  • Return AnB (before A is modified)

Put another way, I want to remove everything in B from A and return what was removed (but don't assume that B is contained in A).

I'd like to know how to do this efficiently, i.e. I don't want to create lots of "copies" of objects (possibly I mean clones here in Rust language). In my mind it should be possible to basically split A into two parts and transfer ownership in such a way that no new allocations (copies/clones) are needed.

I tried this originally with .intersection and .difference and fell flat on my face. After being lost in the woods with references and lifetimes for a while I came to the conclusion that these methods wouldn't allow me to do what I want without needless copying.

The solution I came up with is below. I've simplified it a little as my original code had generics. I now just have HashSets of some custom struct SomeData.

My questions are:

  • Is this as efficient as it can be?

  • Am I right in my conclusion that .intersection and .difference wouldn't be able to help me here?

      #[derive(Eq, PartialEq, Hash, Debug, Clone)]
      struct SomeData {
          i: i32,
      }
    
      fn remove_values(all_values: &mut HashSet<SomeData>, remove_values: & HashSet<SomeData> ) -> HashSet<SomeData> {
          // Whatever way I do this, I am always going to have to create a new container for the return
          // values
          let mut removed_values: HashSet<SomeData> = HashSet::new();
    
          for remove_me in remove_values.iter() {
              // My understanding here is that `take` will not just remove the value from all_values (if 
              // it exists) it will also give me ownership of the value, i.e. C++ equivalent to move 
              // semantics? 
              if let Some(removed) = all_values.take(remove_me) {
                  removed_values.insert(removed);
              }
          }
          removed_values
      }
    

As an extra note, one thing I tried to do was prove this was as efficient as I thought. I took my naive C++ brain and figured I could just print out pointer values and see what was in my sets before and after calling this method:

let mut a: HashSet<_> = [SomeData{i:1}, SomeData{i:2}, SomeData{i:3}, SomeData{i:4}, SomeData{i:5}].iter().cloned().collect();
let b: HashSet<_> = [SomeData{i:4}, SomeData{i:5}, SomeData{i:6}].iter().cloned().collect();
for aval in a.iter() {
    println!("a before: {:?} {:p}", aval, aval);
}
for bval in b.iter() {
    println!("b before: {:?} {:p}", bval, bval);
}

let x = remove_values(& mut a, &b);

for aval in a.iter() {
    println!("a after: {:?} {:p}", aval, aval);
}

for xval in x.iter() {
    // Expect the addresses here to have been originally in a
    println!("removed: {:?} {:p}", xval, xval);
}

My expectation was that the "removed" pointers would have originally been in set a but they weren't. After some reflection, I believe this makes sense. My bet is that this is really equivalent to C++ move semantics. The values in the sets are actually on the stack. Whilst my code might be efficient, there's basically a bunch of places in the code which create new stack values and "move" (transfer ownership) of the data in the SomeData structs. So it's likely not a surprise that pointers are different.

If this is true then I'm not 100% sure how I could prove this any other way. My thought was to implement Clone myself and log out clones. I saw none, which I assume means I'm only getting Copyies, which I think mean I'm only getting moves in C++ language.

Sorry if this is a little unclear, I'm still getting my head around a few things.

Thansk

I'm not sure I understand the question. Is this correct?

  1. We start out with old_A, old_B

  2. We want to compute new_A = set_minus(old_A, old_B), and new_B = intersect(old_A, old_B)

  3. We want to do this with as little copying as possible, (and it's okay to consume old_A, old_B in the process).

Is the above a correct statement of the problem?

Correct on all cases :+1:

1 Like

I don't have a Rust IDE on this machine, and I can't write perfect Rust code without an IDE, so pseudocode is the best I can do:

let mut old_A: HashSet<...> = ...;
let mut old_B: HashSet<...> = ...;

let mut new_B = HashSet::new();
// no new_A as we're going to modify old_A in place

// it's important that we are using into_iter rather than iter, as we're going to consume old_B

for x in old_B.into_iter() {
  if old_A.contains(&x) {
    old_A.remove(&x);
    new_B.insert(x); // no copying here
  }
}

I think the gist of that should work.

Here you go:

/// When this function returns, set1 contains all elements only found in
/// set1, and set2 contains all elements found in both sets.
fn intersect_diff<T: Hash + Eq>(set1: &mut HashSet<T>, set2: &mut HashSet<T>) {
    set2.retain(|item| set1.remove(item));
}

playground

11 Likes

I'm not sure, but is that O(n log n)?
I can't really tell from the snippet.

No, it's linear time because hash sets have constant time removal. If it was a BTreeMap, it would have been O(n log n).

3 Likes