Copy Rc<Vec<T>> into a VecDeque<T>

We have:

pub fn fast_copy<T: Clone>(data: Rc<Vec<T>>) -> VecDeque<T> {
  let mut ans = VecDeque::new();
  for x in data.as_ref().iter() {
    ans.push_back(x.clone());
  }
  ans

Is there a faster way to do this instead of doing a VecDeque::new() followed by lots of VecDeque::push_back ?

Probably by cloning the vector and then using the From implementation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=70647d9b365f621d37189321290fb31e

3 Likes

You can use

pub fn fast_copy<T: Clone>(data: Rc<Vec<T>>) -> VecDeque<T> {
    FromIterator::from_iter(data.iter().cloned())
}

or

pub fn fast_copy<T: Clone>(data: Rc<Vec<T>>) -> VecDeque<T> {
    (*data).clone().into()
}

but you’d have to test the performance.

3 Likes

My curiosity got the better of me. Here is the benchmark results.

running 3 tests
test fast_copy1 ... bench:       2,566 ns/iter (+/- 180)                                                                                                                      
test fast_copy2 ... bench:       2,356 ns/iter (+/- 150)                                                                                                                      
test fast_copy3 ... bench:         120 ns/iter (+/- 10) 

where T is i32. The third example above has a tight loop copy, while, other solutions seem to break the tight loop with its iteration logic. So as always tight loops wins on x86, or for that matter any stored program architectures.

Ref: https://github.com/prataprc/rplay/blob/master/benches/src/lib.rs

4 Likes

<VecDeque<T> as From<Vec<T>>>::from currently reuses the cloned vec (RawVec) allocation (it just reallocates so that its size reaches a power of 2, so potentially triggering just a single memcpy of [T; len] bytes). And given that Vec::clone is quite cheap when T is Copy and small (such as with integer types), the fact that fast_copy3 may be the fastest is quite plausible.

5 Likes

The magnitude of difference here makes me suspicious of a benchmarking quirk.

Compiling the example, I see

warning: unused variable: `out`
  --> src/lib.rs:36:13
   |
36 |         let out: VecDeque<i32> = (*data).clone().into();
   |             ^^^ help: consider prefixing with an underscore: `_out`

That means the compiler might be smart enough to just take the code out, since it’s not black-boxed. That bit of code should be

    b.iter(|| {
        let out: VecDeque<i32> = (*data).clone().into();
        out
    });

so that #[bench] ensures that the result isn’t optimized out. (The other ones need similar changes.)

As another minor option to try, since the data is Copy:

#[bench]
fn fast_copy4(b: &mut Bencher) {
    let mut arr = Vec::new();
    arr.resize(1000, 0);
    let data = Rc::new(arr);
    b.iter(|| {
        data.iter().copied().collect::<VecDeque<i32>>()
    });
}

Did not think about compiler “optimizing out” un-used code. Good point.
I have amended the code with the above recommendation along with fast_copy4 version.

running 4 tests
test fast_copy1 ... bench:       2,519 ns/iter (+/- 254)
test fast_copy2 ... bench:       2,415 ns/iter (+/- 228)
test fast_copy3 ... bench:         119 ns/iter (+/- 13)
test fast_copy4 ... bench:       2,501 ns/iter (+/- 311)

I too feel fast_copy3() is insanely fast compared to other logic.

If I reduce the resize() from 1000 to 100.

test fast_copy1 ... bench:         392 ns/iter (+/- 10)
test fast_copy2 ... bench:         253 ns/iter (+/- 19)
test fast_copy3 ... bench:          81 ns/iter (+/- 2)
test fast_copy4 ... bench:         257 ns/iter (+/- 21)

May be it is the way modern x86 is built, with all its cache-line fetches, cache-prefetches, cold/warm/hot paths etc… ?

Thanks for trying that out.

It looks like it’s all down to specialization inside the library in one of the scenarios but not the others:

The others ought to be better, since they have perfect length information and the copies can’t fail, but apparently that doesn’t happen.

1 Like