Deduplicate vector in place while preserving order

cricket · March 6, 2021, 4:36pm

Hi. I'm rust a beginner and would like to know if this is the right way to deduplicate a vector in place while preserving its order so that if an element is repeated anywhere in the vector, you should keep the element that appears first. I would mainly like to know if my solution is idiomatic or if there is a better way to do it. Thank you

fn dedup(v: &mut Vec<i32>) {
    let mut set = HashSet::new();
    let mut indices = Vec::new();

    for i in 0..v.len() {
        if !set.contains(&v[i]){
            set.insert(v[i]);
        } else {
            indices.push(i);
        }
    }

    for (pos, e) in indices.iter().enumerate() {
        v.remove(*e - pos);
    }
}

mbrubeck · March 6, 2021, 4:47pm

Welcome!

You can simplify this slightly (and reduce the number of HashSet lookups) by using the return value from HashSet::insert:

fn dedup(v: &mut Vec<i32>) {
    let mut set = HashSet::new();
    let mut indices = Vec::new();

    for i in 0..v.len() {
        if !set.insert(v[i]){
            indices.push(i);
        }
    }

    for (pos, e) in indices.iter().enumerate() {
        v.remove(*e - pos);
    }
}

Also, calling Vec::remove repeatedly is fairly inefficient. You can probably improve speed by using Vec::retain, or by imitating what it does internally.

OptimisticPeach · March 6, 2021, 4:51pm

Combining your approaches results in the following extremely short (and elegant) function:

fn dedup(v: &mut Vec<i32>) {
    let mut set = HashSet::new();
    
    v.retain(|x| set.insert(*x));
}

cricket · March 6, 2021, 5:01pm

Thank you so much @mbrubeck and @OptimisticPeach. I've never seen retain before but the documentation makes me think it is something like javascript's filter but in place(?). This does look infinitely more elegant when compared to mine . Could you please tell me what those vertical pipes around x do?

OptimisticPeach · March 6, 2021, 5:05pm

Yup, quoting the docs:

This method operates in place, visiting each element exactly once in the original order, and preserves the order of the retained elements.

That's a closure. retain takes a predicate, which is just something that implements the FnMut trait.

The full syntax for closures is the following:

let closure = |x: &i32| -> bool { set.insert(*x) };

Although it can be shorthanded because the compiler can infer the input types (&i32), and the output type (bool) from context (unlike a function).

Closures are necessary here, to allow the predicate to refer to a local variable, something that a function could not do.

erelde · March 6, 2021, 5:07pm

Checking the definition of retain, it's pretty cool, it avoids all the defects of the previous solutions ^^

cricket · March 6, 2021, 5:13pm

Makes sense. Thanks again for all the help!

syphar · March 7, 2021, 6:01am

If this is about just solving the problem, there is also a dedup method in the itertools crate.

SkiFire13 · March 7, 2021, 10:28am

That's not in place

syphar · March 7, 2021, 10:31am

You're right, I missed that part in the original question

aliXsed · March 24, 2021, 1:15am

Performance-wise, I'd say this is the better way of doing that:

fn dedup(v: &mut Vec<i32>) {
    let len = v.len();
    
    if len < 2 {
        return;
    }
    
    for i in (1..len).rev() {
        if v[0..i].contains(&v[i]) {
            v.remove(i);
        }
    }
}

OptimisticPeach · March 24, 2021, 1:51am

Well, looking at some benchmarks I came up with: ~~Playground~~ Rust Playground version 2: refactored(Copy these into a project locally and run with cargo +nightly bench)

I get the results as follows:

running 4 tests
test dedup_find_number     ... bench:      59,244 ns/iter (+/- 10,058)
test dedup_find_strings    ... bench:     143,258 ns/iter (+/- 12,633)
test dedup_hashset_number  ... bench:      18,250 ns/iter (+/- 1,401)
test dedup_hashset_strings ... bench:     197,225 ns/iter (+/- 17,380)

running 4 tests
test dedup_find_number     ... bench:      48,040 ns/iter (+/- 7,151)
test dedup_find_strings    ... bench:     155,147 ns/iter (+/- 31,426)
test dedup_hashset_number  ... bench:      18,163 ns/iter (+/- 1,278)
test dedup_hashset_strings ... bench:     198,285 ns/iter (+/- 41,140)

From which I can say that @mbrubeck's combined approach is better for numbers, whereas yours is better for strings.

aliXsed · March 24, 2021, 3:23am

I can improve my solution slightly:

fn dedup<T: PartialEq>(v: &mut Vec<T>) {
    let mut i = v.len();

    if i < 2 {
        return;
    }

    while i > 1 {
        i = i - 1;
        if v[0..i].contains(&v[i]) {
            v.remove(i);
        }
    }
}

However using your benchmark framework, @mbrubeck 's hashset approach is still faster for number.

OptimisticPeach · March 24, 2021, 4:54am

I would expect the hashing solution to be better for numbers:

Accessing a HashMap is supposed to occur in near O(1) time for most things, and as such outperforms your search which occurs in O(n) time.

Upon examining the issue more closely, it occurred to me why linear search is better than hashset in the case of strings: I only have 100 unique strings, so obviously hashing hundreds of thousands of characters is more expensive than testing about 100 strings for equality (I don't know what term would better fit than 100). If I set the parameters to the following:

Number of numbers elements: 80 000
Number of string elements: 80 000
Maximum number of characters per string: 170
Number of unique strings in my dataset: 5000

running 4 tests
test dedup_find_number     ... bench:   3,974,130 ns/iter (+/- 539,529)
test dedup_find_strings    ... bench: 190,176,750 ns/iter (+/- 13,634,351)
test dedup_hashset_number  ... bench:   1,228,892 ns/iter (+/- 172,055)
test dedup_hashset_strings ... bench:  24,453,160 ns/iter (+/- 3,741,355)

The tests are much different.

In other words, the test data you use is quite important in determining the results, especially for such an algorithm.

Hyeonu · March 24, 2021, 5:58am

I've changed the Vec<String> into Vec<Rc<str>> to minimize cloning cost and this is the result on my machine.

running 4 tests
test dedup_find_number     ... bench:      50,873 ns/iter (+/- 2,136)
test dedup_find_strings    ... bench:      80,526 ns/iter (+/- 6,963)
test dedup_hashset_number  ... bench:      12,220 ns/iter (+/- 1,506)
test dedup_hashset_strings ... bench:      40,947 ns/iter (+/- 2,845)

SkiFire13 · March 24, 2021, 11:02am

You don't need Rc (nor unsafe) to avoid cloning:

use hashbrown::raw::RawTable;
use std::collections::hash_map::RandomState;
use std::hash::{BuildHasher, Hash, Hasher};

fn dedup<T: Hash + Eq>(v: &mut Vec<T>) {
    let mut set = RawTable::new();
    let random_state = RandomState::new();
    let make_hash = |s: &T| {
        let mut hasher = random_state.build_hasher();
        s.hash(&mut hasher);
        hasher.finish()
    };
    
    let len = v.len();
    let mut del = 0;
    for i in 0..len {
        let current = &v[i];
        let hash = make_hash(current);
        let present = set.find(hash, |&j| &v[j] == current).is_some();
        
        if present {
            del += 1;
        } else if del > 0 {
            v.swap(i - del, i);
        }
        
        if !present {
            set.insert(hash, i-del, |&j| make_hash(&v[j]));
        }
    }
    if del > 0 {
        v.truncate(len - del);
    }
}

This is basically Vec::retain inlined plus an hashset that stores indices into the first part of the Vec (the one that contains the first unique items) but hashes its corresponding items (that's why I used RawTable)

However the benches are quite different than yours @Hyeonu, could you post your code? Edit: I think that may be because the bench also includes cloning the original vec, which is much faster if it contains Rcs instead of Strings, so you aren't actually comparing the same thing.

with small dataset:

test dedup_find_number      ... bench:      30,381 ns/iter (+/- 249)
test dedup_find_strings     ... bench:     166,714 ns/iter (+/- 977)
test dedup_rawtable_number  ... bench:      14,975 ns/iter (+/- 227)
test dedup_rawtable_strings ... bench:     124,579 ns/iter (+/- 2,828)

with bigger dataset:

test dedup_find_number      ... bench:   2,605,550 ns/iter (+/- 164,995)
test dedup_find_strings     ... bench: 154,721,810 ns/iter (+/- 2,262,849)
test dedup_rawtable_number  ... bench:     981,043 ns/iter (+/- 20,523)
test dedup_rawtable_strings ... bench:  15,622,630 ns/iter (+/- 1,227,135)

Hyeonu · March 24, 2021, 11:50am

This is the code I've used. Nearly identical with the @OptimisticPeach's code except for the string type. I don't see why it's not comparing the same things though - it does bench the vector cloning, in both cases.

I also found that the Rc<T> has optimized PartialEq impl with Rc::ptr_eq() shortcut, which may affect the result. I wasn't expected it since the &T doesn't have such optimization. But the rational explained in the source code seems reasonable.

So I've made a newtype over Rc<str> with impl PartialEq which actually compares the strs. Nothing meaningfully changed though.

running 4 tests
test dedup_find_number     ... bench:      52,983 ns/iter (+/- 11,594)
test dedup_find_strings    ... bench:      80,483 ns/iter (+/- 9,978)
test dedup_hashset_number  ... bench:      12,244 ns/iter (+/- 698)
test dedup_hashset_strings ... bench:      42,265 ns/iter (+/- 5,417)

SkiFire13 · March 24, 2021, 12:26pm

Cloning a vector also involves cloning its elements, which is expensive when their type is String but cheap when it's Rc<str>.
This makes comparing dedup_find_strings and dedup_hashset_strings with other benchmarks meaningless since they have different constant costs.

However I guess dedup_find_strings-dedup_hashset_strings could still be comparable since the constant costs of the 2 benches cancel out. In fact in both my version and your version this is equal to 40000 ns/iter, while it originally was -40000 ns/iter (notice the sign), so it checks out.

system · June 22, 2021, 12:26pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Sorted unique list from Iterator<T: Ord + Hash>?	7	1580	April 1, 2021
Unique function in Rust	3	1529	August 24, 2020
How to find the set of a vector?	8	895	August 20, 2020
Vector operations returning new vector instead of mutating in place help	10	5352	January 12, 2023
Collecting from HashSet to Vector help	4	3384	March 30, 2022

Deduplicate vector in place while preserving order

Related Topics