Performance comparison between kvec and C


#1

I had a go at testing some raw allocation performance on an i5-7300 against kvec library https://github.com/attractivechaos/klib

Got some interesting results

C array, preallocated: 0.400 sec
C array, dynamic: 1.073 sec
C vector, dynamic (kv_push): 1.077 sec
C vector, dynamic (kv_a): 0.505 sec
C++ vector, preallocated: 0.402 sec
C++ vector, dynamic: 2.070 sec

Rust jemalloc
dynamic : 1.23s
pre-allocated: 0.631s

Rust system allocator
dynamic : 1.27s
pre-allocated: 0.585

Some initial conclusions
Preallocation is the way to go, no matter what the language; but even with dynamic arrays, Rust comfortably outperforms C++ and gets within 15% of C/kvec library.
The system allocator is possibly good for one off allocations where there’s not a lot of reallocation going on.


#2

This might be a daft question, but did you run the Rust, C and C++ code with any optimisations enabled?


#3

Can you post the test code? It’s surprising that Rust is slower than klib
here.


#4

Sure. The C/C++ code is the one from the authors blog https://attractivechaos.wordpress.com/2008/09/19/c-array-vs-c-vector built in VS2015 release mode.

The rust version, is a trivial loop, optimised and bench’ed

 #[bench]
fn preallocated(b: &mut Bencher) {
    b.iter(|| {
        for _j in 0..10 {
            let mut v = Vec::with_capacity(20000000);
            for i in 0..20000000 {
                v.push(i);
            }
        }
    });
}

#[bench]
fn dynamic(b: &mut Bencher) {
    b.iter(|| {
        for _j in 0..10 {
            let mut v = Vec::new();
            for i in 0..20000000 {
                v.push(i);
            }
        }
    });
}

I had a quick look on godbolt at the assembly. Rust doesn’t unroll the loops,(-O3 -target-cpu=native) whereas MSVC does 4x unroll. Would be curious to know if there’s an option to unroll the loop to match?


#5

Unrolling annotations are still missing. In the meantime you can add a compilation switch like:

-C llvm-args='-unroll-threshold=1000'


#6

Oddly this didn;t make a difference to the generated code on 1.24. Maybe someone else has different result on godbolt?


#7

I know this isn’t exactly what you’re testing, but if you want to see some cool assembly, try this instead:

#[bench]
fn dynamic(b: &mut Bencher) {
    b.iter(|| {
        for _j in 0..10 {
            let mut v = Vec::new();
            v.extend(0..20000000);
        }
    });
}

#8

I have to admit that was quite cool. I wonder why in this case it utilises the larger ymm registers, and unrolls, but in the simpler original case it doesn’t do either. Is this a possible area of optimisation in the compiler?


#9

It likely uses the vector registers because the loop is unrolled. Like inlining for functions, unrolling is the “gateway” optimization for loops.

extend() has a specialization for a TrustedLen iterator - it ensures the vec has just enough capacity to store the entire iterator, and then has a simple raw ptr based loop that copies elements into it.

The push based loop, on the other hand, ends up with a lot more code in the loop body, including at least one guaranteed-opaque function: RawVec::double. This function is explicitly marked #[inline(never)]. Compilers will typically not optimize around these because they can no longer reason about their effects.


Can the compiler optimize repeated pushes to Vec/String?