Performance comparison between kvec and C

I had a go at testing some raw allocation performance on an i5-7300 against kvec library https://github.com/attractivechaos/klib

Got some interesting results

C array, preallocated: 0.400 sec
C array, dynamic: 1.073 sec
C vector, dynamic (kv_push): 1.077 sec
C vector, dynamic (kv_a): 0.505 sec
C++ vector, preallocated: 0.402 sec
C++ vector, dynamic: 2.070 sec

Rust jemalloc
dynamic : 1.23s
pre-allocated: 0.631s

Rust system allocator
dynamic : 1.27s
pre-allocated: 0.585

Some initial conclusions
Preallocation is the way to go, no matter what the language; but even with dynamic arrays, Rust comfortably outperforms C++ and gets within 15% of C/kvec library.
The system allocator is possibly good for one off allocations where there's not a lot of reallocation going on.

1 Like

This might be a daft question, but did you run the Rust, C and C++ code with any optimisations enabled?

1 Like

Can you post the test code? It's surprising that Rust is slower than klib
here.

Sure. The C/C++ code is the one from the authors blog https://attractivechaos.wordpress.com/2008/09/19/c-array-vs-c-vector built in VS2015 release mode.

The rust version, is a trivial loop, optimised and bench'ed

 #[bench]
fn preallocated(b: &mut Bencher) {
    b.iter(|| {
        for _j in 0..10 {
            let mut v = Vec::with_capacity(20000000);
            for i in 0..20000000 {
                v.push(i);
            }
        }
    });
}

#[bench]
fn dynamic(b: &mut Bencher) {
    b.iter(|| {
        for _j in 0..10 {
            let mut v = Vec::new();
            for i in 0..20000000 {
                v.push(i);
            }
        }
    });
}

I had a quick look on godbolt at the assembly. Rust doesn't unroll the loops,(-O3 -target-cpu=native) whereas MSVC does 4x unroll. Would be curious to know if there's an option to unroll the loop to match?

Unrolling annotations are still missing. In the meantime you can add a compilation switch like:

-C llvm-args='-unroll-threshold=1000'

3 Likes

Oddly this didn;t make a difference to the generated code on 1.24. Maybe someone else has different result on godbolt?

I know this isn't exactly what you're testing, but if you want to see some cool assembly, try this instead:

#[bench]
fn dynamic(b: &mut Bencher) {
    b.iter(|| {
        for _j in 0..10 {
            let mut v = Vec::new();
            v.extend(0..20000000);
        }
    });
}
1 Like

I have to admit that was quite cool. I wonder why in this case it utilises the larger ymm registers, and unrolls, but in the simpler original case it doesn't do either. Is this a possible area of optimisation in the compiler?

It likely uses the vector registers because the loop is unrolled. Like inlining for functions, unrolling is the "gateway" optimization for loops.

extend() has a specialization for a TrustedLen iterator - it ensures the vec has just enough capacity to store the entire iterator, and then has a simple raw ptr based loop that copies elements into it.

The push based loop, on the other hand, ends up with a lot more code in the loop body, including at least one guaranteed-opaque function: RawVec::double. This function is explicitly marked #[inline(never)]. Compilers will typically not optimize around these because they can no longer reason about their effects.

5 Likes