pub fn fast_copy<T: Clone>(data: Rc<Vec<T>>) -> VecDeque<T> {
let mut ans = VecDeque::new();
for x in data.as_ref().iter() {
ans.push_back(x.clone());
}
ans
Is there a faster way to do this instead of doing a VecDeque::new() followed by lots of VecDeque::push_back ?
My curiosity got the better of me. Here is the benchmark results.
running 3 tests
test fast_copy1 ... bench: 2,566 ns/iter (+/- 180)
test fast_copy2 ... bench: 2,356 ns/iter (+/- 150)
test fast_copy3 ... bench: 120 ns/iter (+/- 10)
where T is i32. The third example above has a tight loop copy, while, other solutions seem to break the tight loop with its iteration logic. So as always tight loops wins on x86, or for that matter any stored program architectures.
<VecDeque<T> as From<Vec<T>>>::from currently reuses the cloned vec (RawVec) allocation (it just reallocates so that its size reaches a power of 2, so potentially triggering just a single memcpy of [T; len] bytes). And given that Vec::clone is quite cheap when T is Copy and small (such as with integer types), the fact that fast_copy3 may be the fastest is quite plausible.
Did not think about compiler "optimizing out" un-used code. Good point.
I have amended the code with the above recommendation along with fast_copy4 version.
running 4 tests
test fast_copy1 ... bench: 2,519 ns/iter (+/- 254)
test fast_copy2 ... bench: 2,415 ns/iter (+/- 228)
test fast_copy3 ... bench: 119 ns/iter (+/- 13)
test fast_copy4 ... bench: 2,501 ns/iter (+/- 311)
I too feel fast_copy3() is insanely fast compared to other logic.
If I reduce the resize() from 1000 to 100.
test fast_copy1 ... bench: 392 ns/iter (+/- 10)
test fast_copy2 ... bench: 253 ns/iter (+/- 19)
test fast_copy3 ... bench: 81 ns/iter (+/- 2)
test fast_copy4 ... bench: 257 ns/iter (+/- 21)
May be it is the way modern x86 is built, with all its cache-line fetches, cache-prefetches, cold/warm/hot paths etc.. ?