SIMD linear search slower than while loop?

Hello folks,

I am trying to optimize my program by rewriting the linear search part with SIMD. However, it turns out that the SIMD implementation is 2x slower than naive while loop.

That really puzzled me, because I have also tested in C++ before, and the same SIMD version was 4x faster...

Apparently it doesn't make sense, but I am just unable to figure out the reason.
BTW: I checked the assembly generated by rustc, and it looks fine to me

Code for rust implementation

Code for cpp implementation

// Experiment results for rust version
linear-lowerbound       time:   [35.641 ns 35.762 ns 35.890 ns]                               
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

linear-upperbound       time:   [31.487 ns 31.576 ns 31.680 ns]                               
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low mild
  1 (1.00%) high severe

     Running target/release/deps/simd_benchmark-a95df7180faacb94
simd-linear-lowerbound  time:   [79.060 ns 79.301 ns 79.587 ns]                                   

simd-linear-upperbound  time:   [80.726 ns 82.009 ns 83.609 ns]                                   
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

// Experiment results for cpp version
Simd Search:    3.585000 ns
Linear Search:  16.232000 ns

One difference is that, in your Rust version, you allocate and initialize the vector within the timed section, which may be dwarfing the time it takes to search the vector. Try doing initialization before timing begins, like the C++ version.

1 Like

There's also a potential bounds check when getting the addr, but it didn't change anything on my machine when I made it unchecked.

I looked at the assembly and saw this:

callq	core::core_arch::x86::avx2::_mm256_cmpgt_epi32
movq	%r14, %rdi
callq	core::core_arch::x86::avx2::_mm256_movemask_epi8

Now, I don't know a lot about assembly, but are these not supposed to be instructions as opposed to function calls?

I agree, but that doesn't explain why simd version is slower...

In the assembly generated, instructions are wrapped with functions:

// core::core_arch::x86::avx2::_mm256_movemask_epi8
	.section	.text._ZN4core9core_arch3x864avx220_mm256_movemask_epi817hac4b55d1f7e84c00E,"ax",@progbits
	.p2align	4, 0x90
	.type	_ZN4core9core_arch3x864avx220_mm256_movemask_epi817hac4b55d1f7e84c00E,@function
	vmovdqa	(%rdi), %ymm0
	vpmovmskb	%ymm0, %eax

How are you building the Rust code? AVX2 isn't enabled by default since not all CPUs support it. One way to enable it is with
RUSTFLAGS='-C target-cpu=native' cargo build --release

The rust flags can be set in the top level crate with .cargo/config see


Yup, a quick godbolt experiment proves that bad target-cpu flags are the problem here.

@Response777 If you want to detect this problem more easily in the future, you may want to follow the suggestion of the core::arch module documentation to conditionally compile your SIMD functions only on CPU architectures where they are supported. This way, you'll get a clean compiler error when you forget to enable avx in your build instead of this kind of weird behavior.

This is one way to do it:

        any(target_arch = "x86", target_arch = "x86_64"),
        target_feature = "avx2"
pub fn lower_bound_avx2(vec: Vec<i32>, value: i32) -> usize { /* ... */ }

You also want to use

let addr = &vec.get_unchecked(i) as *const i32 as *const __m256i;

in your first loop. That will remove the bounds check that couldn't be optimized away automatically.

Thanks a lot!

Thanks, I added cfgs before. But the compiler just kept saying "cannot find function xxx", so I ended up dropping these flags...

That's actually why you want the cfgs -- that was the compiler telling you that the target-cpu was wrong. :slight_smile: (Though admittedly the compiler does not say this very clearly.)