Deduplicate vector in place while preserving order

Hi. I'm rust a beginner and would like to know if this is the right way to deduplicate a vector in place while preserving its order so that if an element is repeated anywhere in the vector, you should keep the element that appears first. I would mainly like to know if my solution is idiomatic or if there is a better way to do it. Thank you :slight_smile:

fn dedup(v: &mut Vec<i32>) {
    let mut set = HashSet::new();
    let mut indices = Vec::new();

    for i in 0..v.len() {
        if !set.contains(&v[i]){
            set.insert(v[i]);
        } else {
            indices.push(i);
        }
    }

    for (pos, e) in indices.iter().enumerate() {
        v.remove(*e - pos);
    }
}
1 Like

Welcome!

You can simplify this slightly (and reduce the number of HashSet lookups) by using the return value from HashSet::insert:

fn dedup(v: &mut Vec<i32>) {
    let mut set = HashSet::new();
    let mut indices = Vec::new();

    for i in 0..v.len() {
        if !set.insert(v[i]){
            indices.push(i);
        }
    }

    for (pos, e) in indices.iter().enumerate() {
        v.remove(*e - pos);
    }
}

Also, calling Vec::remove repeatedly is fairly inefficient. You can probably improve speed by using Vec::retain, or by imitating what it does internally.

2 Likes

Combining your approaches results in the following extremely short (and elegant) function:

fn dedup(v: &mut Vec<i32>) {
    let mut set = HashSet::new();
    
    v.retain(|x| set.insert(*x));
}
30 Likes

Thank you so much @mbrubeck and @OptimisticPeach. I've never seen retain before but the documentation makes me think it is something like javascript's filter but in place(?). This does look infinitely more elegant when compared to mine :slight_smile: . Could you please tell me what those vertical pipes around x do?

Yup, quoting the docs:

This method operates in place, visiting each element exactly once in the original order, and preserves the order of the retained elements.

That's a closure. retain takes a predicate, which is just something that implements the FnMut trait.

The full syntax for closures is the following:

let closure = |x: &i32| -> bool { set.insert(*x) };

Although it can be shorthanded because the compiler can infer the input types (&i32), and the output type (bool) from context (unlike a function).

Closures are necessary here, to allow the predicate to refer to a local variable, something that a function could not do.

1 Like

Checking the definition of retain, it's pretty cool, it avoids all the defects of the previous solutions ^^

Makes sense. Thanks again for all the help!

1 Like

If this is about just solving the problem, there is also a dedup method in the itertools crate.

That's not in place

You're right, I missed that part in the original question

Performance-wise, I'd say this is the better way of doing that:

fn dedup(v: &mut Vec<i32>) {
    let len = v.len();
    
    if len < 2 {
        return;
    }
    
    for i in (1..len).rev() {
        if v[0..i].contains(&v[i]) {
            v.remove(i);
        }
    }
}

Well, looking at some benchmarks I came up with: Playground Rust Playground version 2: refactored(Copy these into a project locally and run with cargo +nightly bench)

I get the results as follows:

running 4 tests
test dedup_find_number     ... bench:      59,244 ns/iter (+/- 10,058)
test dedup_find_strings    ... bench:     143,258 ns/iter (+/- 12,633)
test dedup_hashset_number  ... bench:      18,250 ns/iter (+/- 1,401)
test dedup_hashset_strings ... bench:     197,225 ns/iter (+/- 17,380)

running 4 tests
test dedup_find_number     ... bench:      48,040 ns/iter (+/- 7,151)
test dedup_find_strings    ... bench:     155,147 ns/iter (+/- 31,426)
test dedup_hashset_number  ... bench:      18,163 ns/iter (+/- 1,278)
test dedup_hashset_strings ... bench:     198,285 ns/iter (+/- 41,140)

From which I can say that @mbrubeck's combined approach is better for numbers, whereas yours is better for strings.

1 Like

I can improve my solution slightly:

fn dedup<T: PartialEq>(v: &mut Vec<T>) {
    let mut i = v.len();

    if i < 2 {
        return;
    }

    while i > 1 {
        i = i - 1;
        if v[0..i].contains(&v[i]) {
            v.remove(i);
        }
    }
}

However using your benchmark framework, @mbrubeck 's hashset approach is still faster for number.

I would expect the hashing solution to be better for numbers:

Accessing a HashMap is supposed to occur in near O(1) time for most things, and as such outperforms your search which occurs in O(n) time.


Upon examining the issue more closely, it occurred to me why linear search is better than hashset in the case of strings: I only have 100 unique strings, so obviously hashing hundreds of thousands of characters is more expensive than testing about 100 strings for equality (I don't know what term would better fit than 100). If I set the parameters to the following:

  • Number of numbers elements: 80 000
  • Number of string elements: 80 000
  • Maximum number of characters per string: 170
  • Number of unique strings in my dataset: 5000
running 4 tests
test dedup_find_number     ... bench:   3,974,130 ns/iter (+/- 539,529)
test dedup_find_strings    ... bench: 190,176,750 ns/iter (+/- 13,634,351)
test dedup_hashset_number  ... bench:   1,228,892 ns/iter (+/- 172,055)
test dedup_hashset_strings ... bench:  24,453,160 ns/iter (+/- 3,741,355)

The tests are much different.

In other words, the test data you use is quite important in determining the results, especially for such an algorithm.

2 Likes

I've changed the Vec<String> into Vec<Rc<str>> to minimize cloning cost and this is the result on my machine.

running 4 tests
test dedup_find_number     ... bench:      50,873 ns/iter (+/- 2,136)
test dedup_find_strings    ... bench:      80,526 ns/iter (+/- 6,963)
test dedup_hashset_number  ... bench:      12,220 ns/iter (+/- 1,506)
test dedup_hashset_strings ... bench:      40,947 ns/iter (+/- 2,845)
1 Like

You don't need Rc (nor unsafe) to avoid cloning:

use hashbrown::raw::RawTable;
use std::collections::hash_map::RandomState;
use std::hash::{BuildHasher, Hash, Hasher};

fn dedup<T: Hash + Eq>(v: &mut Vec<T>) {
    let mut set = RawTable::new();
    let random_state = RandomState::new();
    let make_hash = |s: &T| {
        let mut hasher = random_state.build_hasher();
        s.hash(&mut hasher);
        hasher.finish()
    };
    
    let len = v.len();
    let mut del = 0;
    for i in 0..len {
        let current = &v[i];
        let hash = make_hash(current);
        let present = set.find(hash, |&j| &v[j] == current).is_some();
        
        if present {
            del += 1;
        } else if del > 0 {
            v.swap(i - del, i);
        }
        
        if !present {
            set.insert(hash, i-del, |&j| make_hash(&v[j]));
        }
    }
    if del > 0 {
        v.truncate(len - del);
    }
}

This is basically Vec::retain inlined plus an hashset that stores indices into the first part of the Vec (the one that contains the first unique items) but hashes its corresponding items (that's why I used RawTable)

However the benches are quite different than yours @Hyeonu, could you post your code? Edit: I think that may be because the bench also includes cloning the original vec, which is much faster if it contains Rcs instead of Strings, so you aren't actually comparing the same thing.

with small dataset:

test dedup_find_number      ... bench:      30,381 ns/iter (+/- 249)
test dedup_find_strings     ... bench:     166,714 ns/iter (+/- 977)
test dedup_rawtable_number  ... bench:      14,975 ns/iter (+/- 227)
test dedup_rawtable_strings ... bench:     124,579 ns/iter (+/- 2,828)

with bigger dataset:

test dedup_find_number      ... bench:   2,605,550 ns/iter (+/- 164,995)
test dedup_find_strings     ... bench: 154,721,810 ns/iter (+/- 2,262,849)
test dedup_rawtable_number  ... bench:     981,043 ns/iter (+/- 20,523)
test dedup_rawtable_strings ... bench:  15,622,630 ns/iter (+/- 1,227,135)
2 Likes

This is the code I've used. Nearly identical with the @OptimisticPeach's code except for the string type. I don't see why it's not comparing the same things though - it does bench the vector cloning, in both cases.

I also found that the Rc<T> has optimized PartialEq impl with Rc::ptr_eq() shortcut, which may affect the result. I wasn't expected it since the &T doesn't have such optimization. But the rational explained in the source code seems reasonable.

So I've made a newtype over Rc<str> with impl PartialEq which actually compares the strs. Nothing meaningfully changed though.

running 4 tests
test dedup_find_number     ... bench:      52,983 ns/iter (+/- 11,594)
test dedup_find_strings    ... bench:      80,483 ns/iter (+/- 9,978)
test dedup_hashset_number  ... bench:      12,244 ns/iter (+/- 698)
test dedup_hashset_strings ... bench:      42,265 ns/iter (+/- 5,417)

Cloning a vector also involves cloning its elements, which is expensive when their type is String but cheap when it's Rc<str>.
This makes comparing dedup_find_strings and dedup_hashset_strings with other benchmarks meaningless since they have different constant costs.

However I guess dedup_find_strings-dedup_hashset_strings could still be comparable since the constant costs of the 2 benches cancel out. In fact in both my version and your version this is equal to 40000 ns/iter, while it originally was -40000 ns/iter (notice the sign), so it checks out.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.