Rust autovectorization issues

Hi, everyone!

Before you close the tab I can tell you that I built Rust code with --release option :slight_smile:

What I am trying here is to achieve at least the same order of magnitude with C++ in this simple benchmark.

Here are the details: the idea of this benchmark is to sum all elements in a vector that are greater than 0.5. Here is the reference code using NumPy:

In [1]: import numpy as np
In [2]: v = np.random.rand(100000000)
In [3]: %timeit np.sum(v[v>0.5])
907 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If we use jit from numba, the time can be significantly reduced:

In [4]: from numba import jit
In [5]: @jit
   ...: def sum_selected(v: np.ndarray):
   ...:     res_sum = np.float64(0.0)
   ...:     for i in v:
   ...:         if i > 0.5:
   ...:             res_sum += i
   ...:     return res_sum
   ...: 

In [6]: %timeit sum_selected(v)
161 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, here is the Rust version using iterators:

use rand::prelude::*;
use std::time::Instant;

fn main() {
    let mut rng = rand::rng();

    let arr: Vec<f64> = Vec::from_iter((0..100000000).map(|_| rng.random::<f64>()));

    let t1 = Instant::now();
    let sum: f64 = arr
        .iter()
        .filter(|&&x| x > 0.5f64)
        .sum();

    let t2 = t1.elapsed();

    println!("{:8.4},{}", sum, t2.as_secs_f64());
}

The mean elapsed time is 0.22727 s ± 0.00361 s. As you can see it is even slower than Python with jit.

I also tried looping over vector but got almost no difference compared to the
iterator version - 0.22353 s ± 0.00308 s.

use rand::prelude::*;
use std::time::Instant;

fn main() {
    let mut rng = rand::rng();

    let arr: Vec<f64> = Vec::from_iter((0..100000000).map(|_| rng.random::<f64>()));

    let mut sum: f64 = 0.0;
    let t1 = Instant::now();
    for el in arr.iter() {
        if *el > 0.5f64 {
            sum += *el;
        }
    }
    let t2 = t1.elapsed();

    println!("{:8.4},{}", sum, t2.as_secs_f64());
}

Both examples were built with the following options:

RUSTFLAGS="-C target-cpu=native -C target-feature=+avx2" cargo run --release

In emitted assembly I see that the code was vectorized but still it's slower than python with jit.

This C++ code is better by one order of magnitude than Rust version - 0.07285 s ± 0.00139 s (I also benchmarked Fortran with similar to C++ results, but I will
omit the Fortran code).

#include <chrono>
#include <climits>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <stdfloat>
#include <vector>

#define N 100000000

int main()
{
	std::srand(std::time(NULL));

	std::vector<std::float64_t> arr;

	for (size_t i = 0; i < N; ++i) {
		arr.push_back(static_cast<std::float64_t>(std::rand())/RAND_MAX);
	}

	std::chrono::steady_clock::time_point t1 = std::chrono::steady_clock::now();

	std::float64_t sum_res = 0.0f64;
	for (auto el: arr) {
		if (el > 0.5f64)
			sum_res += el;
	}

	std::chrono::steady_clock::time_point t2 = std::chrono::steady_clock::now();
	auto ellapsed = std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count();

    std::cout << sum_res << "," << static_cast<std::float64_t>(ellapsed)/1000000 << std::endl;
}

Here is how I compiled the C++ code:

g++ -Wall -Wextra -Ofast -march=native -mtune=native -ffast-math -std=c++23 select_elements.cpp

Why is Rust variant slower than C++? Do you have any ideas how can I speed up the code? I am not an expert in Rust so maybe I am doing something wrong or miss important compiler options. I am aware there is simd and intrinsics modules but they require rust nightly and frankly speaking offer me to do vectorization manually in contrast to C++ where the work is delegated to the compiler.

Here is the versions of the software I used:

gcc version 14.2.1 20250207
rust 1.85.0
python-numba 0.61.0
python-numpy 2.2.3
1 Like

This may be why. The -ffast-math flag prioritizes performance over adherence to IEEE 754, so it can produce wildly inaccurate results or even introduce miscompilations into a UB-free program.

Rust has decided that our floats are always strictly IEEE 754 floats to avoid the problems that have historically been associated with -ffast-math.

3 Likes

thanks for the suggestion. TIL that -Ofast enables -ffast-math flag. I removed -ffast-math and set -O3:

g++ -Wall -Wextra -O3 -march=native -mtune=native -std=c++23 select_elements.cpp

the elapsed time is 0.58802 s ± 0.01281 s, so it is slower than Rust version now. Interesting.

But I still wonder why is numba then produces quite fast code (even if I explicitly turn off fast math)?

In [3]: @jit(fastmath=False)
   ...: def sum_selected(v: np.ndarray):
   ...:     res_sum = np.float64(0.0)
   ...:     for i in v:
   ...:         if i > 0.5:
   ...:             res_sum += i
   ...:     return res_sum
   ...: 
In [4]: v = np.random.rand(100000000)

In [5]: %timeit sum_selected(v)
163 ms ± 888 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Try changing this to

println!("{:8.4},{}", {sum}, t2.as_secs_f64());

Or extract the loop out into a separate function.

Taking the address of something -- like println does -- makes it harder for LLVM to optimize.

(I don't know if that's the problem here, but we've seen it be a problem before. Benchmarking properly is hard, and I'm always skeptical of anyone doing it by hand with Instant instead of something using statistical methods.)

1 Like

Hi! I tried both your suggestions but it didn't improve anything at all. Anyway, thank you for your help. I agree that -ffast-math is dangerous to use and without it the C++ version loses its competitive power to Rust.

The numba asm partially unrolls the loop. I am unsure why LLVM does that. Maybe numba emmits some IR for loops that does something like chunks_exact?

.LBB0_5:
        vmovsd  (%rdx), %xmm3
        vcmpltsd        %xmm3, %xmm1, %xmm4
        vblendvpd       %xmm4, %xmm3, %xmm2, %xmm3
        vaddsd  %xmm3, %xmm0, %xmm0
        vmovsd  (%r8,%rdx), %xmm3
        vcmpltsd        %xmm3, %xmm1, %xmm4
        vblendvpd       %xmm4, %xmm3, %xmm2, %xmm3
        vaddsd  %xmm3, %xmm0, %xmm0
        vmovsd  (%rdx,%r8,2), %xmm3
        vcmpltsd        %xmm3, %xmm1, %xmm4
        vblendvpd       %xmm4, %xmm3, %xmm2, %xmm3
        vaddsd  %xmm3, %xmm0, %xmm0
        vmovsd  (%rbx,%rdx), %xmm3
        addq    $4, %rsi
        vcmpltsd        %xmm3, %xmm1, %xmm4
        vblendvpd       %xmm4, %xmm3, %xmm2, %xmm3
        vaddsd  %xmm3, %xmm0, %xmm0
        addq    %r14, %rdx
        cmpq    %rsi, %r11
        jne     .LBB0_5

while the rust versions does not.

Here is the LLVM IR of numba

Summary
; Function Attrs: nofree norecurse nosync nounwind
define i32 @_ZN8__main__12sum_selectedB2v2B38c8tJTIcFKzyF2ILShI4CrgQElQb6HczSBAA_3dE5ArrayIdLi1E1C7mutable7alignedE(double* noalias nocapture writeonly %retptr, { i8*, i32, i8*, i8*, i32 }** noalias nocapture readnone %excinfo, i8* nocapture readnone %arg.v.0, i8* nocapture readnone %arg.v.1, i64 %arg.v.2, i64 %arg.v.3, double* %arg.v.4, i64 %arg.v.5.0, i64 %arg.v.6.0) local_unnamed_addr #0 {
entry:
  %.715 = icmp sgt i64 %arg.v.5.0, 0
  br i1 %.715, label %B62.lr.ph, label %B88

B62.lr.ph:                                        ; preds = %entry
  %0 = add i64 %arg.v.5.0, -1
  %xtraiter = and i64 %arg.v.5.0, 3
  %1 = icmp ult i64 %0, 3
  br i1 %1, label %B88.loopexit.unr-lcssa, label %B62.lr.ph.new

B62.lr.ph.new:                                    ; preds = %B62.lr.ph
  %2 = ptrtoint double* %arg.v.4 to i64
  %unroll_iter = and i64 %arg.v.5.0, -4
  %3 = shl i64 %arg.v.6.0, 2
  %4 = mul i64 %arg.v.6.0, 3
  br label %B62

B62:                                              ; preds = %B62, %B62.lr.ph.new
  %lsr.iv11 = phi i64 [ %lsr.iv.next12, %B62 ], [ %2, %B62.lr.ph.new ]
  %res_sum.2.07 = phi double [ 0.000000e+00, %B62.lr.ph.new ], [ %res_sum.3.1.3, %B62 ]
  %.22.06 = phi i64 [ 0, %B62.lr.ph.new ], [ %.92.3, %B62 ]
  %.88 = inttoptr i64 %lsr.iv11 to double*
  %.89 = load double, double* %.88, align 8
  %.125 = fcmp ogt double %.89, 5.000000e-01
  %.129 = select i1 %.125, double %.89, double -0.000000e+00
  %res_sum.3.1 = fadd double %res_sum.2.07, %.129
  %5 = add i64 %arg.v.6.0, %lsr.iv11
  %.88.1 = inttoptr i64 %5 to double*
  %.89.1 = load double, double* %.88.1, align 8
  %.125.1 = fcmp ogt double %.89.1, 5.000000e-01
  %.129.1 = select i1 %.125.1, double %.89.1, double -0.000000e+00
  %res_sum.3.1.1 = fadd double %res_sum.3.1, %.129.1
  %sunkaddr = inttoptr i64 %lsr.iv11 to double*
  %sunkaddr13 = mul i64 %arg.v.6.0, 2
  %6 = bitcast double* %sunkaddr to i8*
  %sunkaddr14 = getelementptr i8, i8* %6, i64 %sunkaddr13
  %7 = bitcast i8* %sunkaddr14 to double*
  %.89.2 = load double, double* %7, align 8
  %.125.2 = fcmp ogt double %.89.2, 5.000000e-01
  %.129.2 = select i1 %.125.2, double %.89.2, double -0.000000e+00
  %res_sum.3.1.2 = fadd double %res_sum.3.1.1, %.129.2
  %8 = add i64 %4, %lsr.iv11
  %.88.3 = inttoptr i64 %8 to double*
  %.89.3 = load double, double* %.88.3, align 8
  %.92.3 = add nuw i64 %.22.06, 4
  %.125.3 = fcmp ogt double %.89.3, 5.000000e-01
  %.129.3 = select i1 %.125.3, double %.89.3, double -0.000000e+00
  %res_sum.3.1.3 = fadd double %res_sum.3.1.2, %.129.3
  %lsr.iv.next12 = add i64 %lsr.iv11, %3
  %niter.ncmp.3 = icmp eq i64 %unroll_iter, %.92.3
  br i1 %niter.ncmp.3, label %B88.loopexit.unr-lcssa, label %B62

B88.loopexit.unr-lcssa:                           ; preds = %B62, %B62.lr.ph
  %res_sum.3.1.lcssa.ph = phi double [ undef, %B62.lr.ph ], [ %res_sum.3.1.3, %B62 ]
  %res_sum.2.07.unr = phi double [ 0.000000e+00, %B62.lr.ph ], [ %res_sum.3.1.3, %B62 ]
  %.22.06.unr = phi i64 [ 0, %B62.lr.ph ], [ %.92.3, %B62 ]
  %lcmp.mod.not = icmp eq i64 %xtraiter, 0
  br i1 %lcmp.mod.not, label %B88, label %B62.epil.preheader

B62.epil.preheader:                               ; preds = %B88.loopexit.unr-lcssa
  %9 = ptrtoint double* %arg.v.4 to i64
  %10 = mul i64 %.22.06.unr, %arg.v.6.0
  %11 = add i64 %9, %10
  br label %B62.epil

B62.epil:                                         ; preds = %B62.epil.preheader, %B62.epil
  %lsr.iv9 = phi i64 [ %xtraiter, %B62.epil.preheader ], [ %lsr.iv.next10, %B62.epil ]
  %lsr.iv = phi i64 [ %11, %B62.epil.preheader ], [ %lsr.iv.next, %B62.epil ]
  %res_sum.2.07.epil = phi double [ %res_sum.3.1.epil, %B62.epil ], [ %res_sum.2.07.unr, %B62.epil.preheader ]
  %.88.epil = inttoptr i64 %lsr.iv to double*
  %.89.epil = load double, double* %.88.epil, align 8
  %.125.epil = fcmp ogt double %.89.epil, 5.000000e-01
  %.129.epil = select i1 %.125.epil, double %.89.epil, double -0.000000e+00
  %res_sum.3.1.epil = fadd double %res_sum.2.07.epil, %.129.epil
  %lsr.iv.next = add i64 %lsr.iv, %arg.v.6.0
  %lsr.iv.next10 = add nsw i64 %lsr.iv9, -1
  %epil.iter.cmp.not = icmp eq i64 %lsr.iv.next10, 0
  br i1 %epil.iter.cmp.not, label %B88, label %B62.epil, !llvm.loop !0

B88:                                              ; preds = %B62.epil, %B88.loopexit.unr-lcssa, %entry
  %res_sum.2.0.lcssa = phi double [ 0.000000e+00, %entry ], [ %res_sum.3.1.lcssa.ph, %B88.loopexit.unr-lcssa ], [ %res_sum.3.1.epil, %B62.epil ]
  store double %res_sum.2.0.lcssa, double* %retptr, align 8
  ret i32 0
}

I do not properly understand LLVM IR but unroll_iter smells a bit like manual unrolling.

Yeah, that looks like the core of the difference.

If you isolate just the summing loop, the optimized LLVM IR is one-double-at-a-time on the default target: https://rust.godbolt.org/z/E6j64acnz

You can get much shorter ASM if you target AVX2, but it's still one-at-a-time: https://rust.godbolt.org/z/5Yzbazo8n. I don't know if that's materially faster.

Of course, on nightly you can also manually vectorize it, if you're ok with getting a different answer: https://rust.godbolt.org/z/7oWMeMPeG

Floating point addition is not associative, but -Ofast / fast-math ignores this issue which allows it to vectorize the code by computing partial 4 sums in parallel, then summing up at the end. Rust will respect the lack of associativity and update the sum one addition at a time.

In Rust you can get similar behavior by manually computing multiple partial sums at once:

use rand::prelude::*;
use std::time::Instant;

fn main() {
    let mut rng = rand::rng();

    let arr: Vec<f64> = Vec::from_iter((0..100000000).map(|_| rng.random::<f64>()));

    let t1 = Instant::now();

    let sum_a: f64 = arr.iter().filter(|&&x| x > 0.5f64).sum();

    let t2 = t1.elapsed();
    let t3 = Instant::now();

    const CHUNK: usize = 4;
    let mut sums = [0.0f64; CHUNK];
    let mut chunks = arr.chunks_exact(CHUNK);
    for chunk in &mut chunks {
        for i in 0..CHUNK {
            let x = chunk[i];
            sums[i] += if x > 0.5f64 { x } else { 0. };
        }
    }
    let sum_b: f64 = chunks
        .remainder()
        .into_iter()
        .filter(|&&x| x > 0.5f64)
        .sum::<f64>()
        + sums.into_iter().sum::<f64>();

    let t4 = t3.elapsed();

    println!(
        "{:8.4},{} -- {:8.4},{}",
        sum_a,
        t2.as_secs_f64(),
        sum_b,
        t4.as_secs_f64()
    );
}
6 Likes

Wow, thanks a lot. Your code is now on par with the C++ example. In my case the mean time is 0.07355s ± 0.00054s.