Removing Shared Vector Elements

Hello, beginner here. I'm writing a program that has two string vectors, and I need to remove any string that's in both lists from both lists, leaving just the unique values. The lists are sorted beforehand.

After doing some research, I came up with the following function, which seems to be working:

fn main() {
	let mut v1 = vec!["A1.2", "A2.2", "B2.1", "CR1.3", "D1.2", "U10.76"];
	let mut v2 = vec!["A1.2", "A2.4", "B2.1", "CR1.A", "D1.K", "U10.76"];
	// remove:         ****            ****                     ******
	
	remove_matches(&mut v1, &mut v2);
	
	assert_eq!(v1, vec!["A2.2", "CR1.3", "D1.2"]);
	assert_eq!(v2, vec!["A2.4", "CR1.A", "D1.K"]);
}

fn remove_matches(v1: &mut Vec<&str>, v2: &mut Vec<&str>) {
	let mut len2 = v2.len();
	let mut i = 0;
	
	while i < v1.len() {
		let word = v1[i].clone();
		
		v2.retain(|s| s != &word);
		
		if v2.len() < len2 {
			v1.retain(|s| s != &word);
			len2 = v2.len();
		} else {
			i += 1;
		}
	}
}

Since I'm still getting used to chaining/closures, as well as the std methods available, I was wondering if there was a better way to implement this functionality.

Any thoughts?

To me this looks like an overdose of chaining/closures:

  • Each string in v1 gets cloned. Pointless overhead.
  • Detecting a match by looking whether an array length changed looks error prone for more complex cases.

A classical approach doesn't have these issues:

fn remove_matches(v1: &mut Vec<&str>, v2: &mut Vec<&str>) {
  let mut i = 0;

  while i < v1.len() {
    let mut j = 0;

    while j < v2.len() {
      if v1[i] == v2[j] {
        v1.remove(i);
        v2.remove(j);
        break;
      }
      j += 1;
    }
    i += 1;
  }
}
1 Like

Ah okay. I actually tried something similar, I just wasn't sure if it was "Rusty" enough. Looks like I was overthinking it.

Thank you very much!

1 Like

This isn't quite right— it doesn't seem to handle duplicate items correctly. In particular, remove_matches(&mut vec!["";1024], &mut vec!["";1024]) leaves half of the items in both vectors instead of removing them all.

It also does O(n²) comparisons, which isn't great for large vectors.


I'd probably start with something like this to take advantage of the lists being presorted. There's still room for optimization here, but it's easy to understand and the asymptotic performance is at reasonable:

fn remove_matches(v1: &mut Vec<&str>, v2: &mut Vec<&str>) {
    let mut v1_iter = std::mem::take(v1).into_iter().peekable();
    let mut v2_iter = std::mem::take(v2).into_iter().peekable();
    
    loop {
        match (v1_iter.peek(), v2_iter.peek()) {
            (None,    None   ) => return,
            (Some(_), None   ) => v1.extend(&mut v1_iter),
            (None,    Some(_)) => v2.extend(&mut v2_iter),
            (Some(a), Some(b)) => {
                use std::cmp::Ordering::*;
                match a.cmp(b) {
                    Less    => v1.push(v1_iter.next().unwrap()),
                    Greater => v2.push(v2_iter.next().unwrap()),
                    Equal   => {
                        let _ = v1_iter.next();
                        let _ = v2_iter.next();
                    }
                }
            }
        }
    }
}
2 Likes

It depends on whether you allow duplicate items and expect lists to be sorted. Which you do, but the test case, original code and my code doesn't.

Also, your code is great in theory, but appears to be slower in practice. A naive benchmark:

While absolute results vary, I typically get something like

remove_matches_v1() 153 ms.
remove_matches_v2() 85 ms.
remove_matches_v3() 140 ms.

(v1 = original code, v2 = my code, v3 = your code)

This kind of confirms my general rule that simple code is usually faster. All this messing with Some(), peek(), unwrap() and closures costs time.

Use the right datastructure: I this case, BTreeSet could work nicely, and it has a intersection method.

4 Likes

The test case does use sorted lists, and so does the problem statement:

Also, the problem statement doesn't state that the lists have been deduplicated, so that's not really something a correct solution can rely on.


I'm not surprised by those results at all, as the test lists are quite short and I didn't spend any time doing things like avoiding allocations. If you increase the list lengths a bit, the difference becomes more obvious:

remove_matches_v1() 14411 ms.
remove_matches_v2() 11260 ms.
remove_matches_v3() 7 ms.
2 Likes

You are neglecting asymptotic complexity. @2e71828's code is linear-time, while the other two are quadratic.

2 Likes

I wrote code which did the job and is fast for the given job :slight_smile:

For completeness, here's a testcase where this code misses to remove a duplicate, because it works with equal numbers of duplicates, only:

  let v1 = vec!["A1.2", "A2.2", "B2.1", "B2.1", "CR1.3", "D1.2", "U10.76"];
  let v2 = vec!["A1.2", "A2.4", "B2.1", "CR1.A", "D1.K", "U10.76"];

Also, none of these three versions removes or deduplicates duplicates appearing in only one vector, like

  let v1 = vec!["A1.2", "A2.2", "A2.2"];
  let v2 = vec!["A1.2", "A2.4"];

That said, I agree that "complex" code works quite well for big arrays. Taking sorting into account with "simple" code, I can get it faster for small arrays and as fast, but not faster for big arrays.

Well that's nifty.

use std::collections::BTreeSet;

fn main() {
	let b1 = BTreeSet::from(["A1.2", "A2.2", "B2.1", "CR1.3", "D1.2", "U10.76", "Q1.2"]);
	let b2 = BTreeSet::from(["A1.2", "A2.4", "B2.1", "CR1.A", "D1.K", "U10.76"]);
	// remove:                ****            ****                     ******
	
	let v1: Vec<_> = b1.difference(&b2).collect();
	let v2: Vec<_> = b2.difference(&b1).collect();
	
	assert_eq!(v1, vec![&"A2.2", &"CR1.3", &"D1.2", &"Q1.2"]);
	assert_eq!(v2, vec![&"A2.4", &"CR1.A", &"D1.K"]);
}

Recently finished my first read-through of the book, wasn't sure where to go next - looks like I'll be taking a dive into collections.

Thank you!

1 Like