Idiomatic way to get difference between two Vecs?

I got two vectors, and I want to get a third one containing all the elements of the second vector that the first doesn't have. Example :
vector 1 : [1, 2, 3, 4, 5, 6]
vector 2 : [1, 2, 3, 4]
vector 3 : [5, 6]
I could make my own algorithm for that, but I was wondering if it wasn't already there.

Usually this is done by grouping your data into HashSets and doing .symmetric_difference(&other) on them.

6 Likes

So there isn't something without hashes ?

I came up with this

let mut difference = vec![];
for i in new_items {
    if !previous_new_items.contains(&i) {
        difference.push(i);
    }
}
```
1 Like

The issue with that is the speed of it, since if they're grouped into a HashSet, then you're only comparing hashes. Additionally, .contains does a linear search, instead of the (probably) binary search used by the HashSet.

2 Likes

Hash is for optimization. For example your code takes O(n^2) to execute, while the one with hash would takes O(n). But keep in mind that there's not silver bullet on performance, your code may run faster on small inputs like vectors with length under 100.

1 Like

My inputs are small, so I think it will run faster... Also, doesn't hashing take a lot of processing power ?

Depends on your data, if you're hashing a few numbers, then it's just a few XOr instructions usually, however for things like Strings, it may require you hash each byte (or group of bytes) together. It really depends on the hasher you've used for the HashSet or HashMap. Stdlib uses an algorithm to avoid repetition attacks, while there are other algorithms implemented in separate crates which focus on more speed.

2 Likes

They are strings

If you’re worried about the hash algorithm speed, you can always use BTreeSet instead. It has slightly worse performance on large sets, but is based on ordinary comparisons. As with anything performance related, reasoning will only get you so far— try it a few ways and measure the performance directly.

1 Like
  • If you are handling only vectors with small number of elements (say, less than 100), your code is fine. But I would write it using iterators:
let difference: Vec<_> = new_items.into_iter().filter(|item| !previous_items.contains(item)).collect();

which is more preferred (idiomatic Rust) IMO because mut binding of difference is not needed.

  • Yes, the code has quadratic time complexity. If, for example, the code is used in a web API and a user can provide the input, then a malicious user can cause DoS (denial of service).
  • Using HashSet is preferred for larger input. Using HashSet for smaller input can potentially be slower (due to overheads) / generates a larger code, but if the code is not used in a performance-critical part then don't bother; you may use whatever code you like.
let item_set: HashSet<_> = previous_items.iter().collect();
let difference: Vec<_> = new_items.into_iter().filter(|item| !item_set.contains(item)).collect();
8 Likes

Thanks for the tip with iterators, going to use that !

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.