Rust system allocator
dynamic : 1.27s
pre-allocated: 0.585
Some initial conclusions
Preallocation is the way to go, no matter what the language; but even with dynamic arrays, Rust comfortably outperforms C++ and gets within 15% of C/kvec library.
The system allocator is possibly good for one off allocations where there's not a lot of reallocation going on.
The rust version, is a trivial loop, optimised and bench'ed
#[bench]
fn preallocated(b: &mut Bencher) {
b.iter(|| {
for _j in 0..10 {
let mut v = Vec::with_capacity(20000000);
for i in 0..20000000 {
v.push(i);
}
}
});
}
#[bench]
fn dynamic(b: &mut Bencher) {
b.iter(|| {
for _j in 0..10 {
let mut v = Vec::new();
for i in 0..20000000 {
v.push(i);
}
}
});
}
I had a quick look on godbolt at the assembly. Rust doesn't unroll the loops,(-O3 -target-cpu=native) whereas MSVC does 4x unroll. Would be curious to know if there's an option to unroll the loop to match?
I have to admit that was quite cool. I wonder why in this case it utilises the larger ymm registers, and unrolls, but in the simpler original case it doesn't do either. Is this a possible area of optimisation in the compiler?
It likely uses the vector registers because the loop is unrolled. Like inlining for functions, unrolling is the "gateway" optimization for loops.
extend() has a specialization for a TrustedLen iterator - it ensures the vec has just enough capacity to store the entire iterator, and then has a simple raw ptr based loop that copies elements into it.
The push based loop, on the other hand, ends up with a lot more code in the loop body, including at least one guaranteed-opaque function: RawVec::double. This function is explicitly marked #[inline(never)]. Compilers will typically not optimize around these because they can no longer reason about their effects.