SIMD linear search slower than while loop?

Response777 · January 7, 2020, 7:23am

Hello folks,

I am trying to optimize my program by rewriting the linear search part with SIMD. However, it turns out that the SIMD implementation is 2x slower than naive while loop.

That really puzzled me, because I have also tested in C++ before, and the same SIMD version was 4x faster...

Apparently it doesn't make sense, but I am just unable to figure out the reason.
BTW: I checked the assembly generated by rustc, and it looks fine to me

Code for rust implementation

Code for cpp implementation

// Experiment results for rust version
linear-lowerbound       time:   [35.641 ns 35.762 ns 35.890 ns]                               
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

linear-upperbound       time:   [31.487 ns 31.576 ns 31.680 ns]                               
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low mild
  1 (1.00%) high severe

     Running target/release/deps/simd_benchmark-a95df7180faacb94
simd-linear-lowerbound  time:   [79.060 ns 79.301 ns 79.587 ns]                                   

simd-linear-upperbound  time:   [80.726 ns 82.009 ns 83.609 ns]                                   
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe


// Experiment results for cpp version
Simd Search:    3.585000 ns
Linear Search:  16.232000 ns

jessa0 · January 7, 2020, 7:51am

One difference is that, in your Rust version, you allocate and initialize the vector within the timed section, which may be dwarfing the time it takes to search the vector. Try doing initialization before timing begins, like the C++ version.

alice · January 7, 2020, 8:10am

There's also a potential bounds check when getting the addr, but it didn't change anything on my machine when I made it unchecked.

I looked at the assembly and saw this:

callq	core::core_arch::x86::avx2::_mm256_cmpgt_epi32
movq	%r14, %rdi
callq	core::core_arch::x86::avx2::_mm256_movemask_epi8

Now, I don't know a lot about assembly, but are these not supposed to be instructions as opposed to function calls?

Response777 · January 7, 2020, 8:30am

I agree, but that doesn't explain why simd version is slower...

Response777 · January 7, 2020, 8:37am

In the assembly generated, instructions are wrapped with functions:

// core::core_arch::x86::avx2::_mm256_movemask_epi8
	.section	.text._ZN4core9core_arch3x864avx220_mm256_movemask_epi817hac4b55d1f7e84c00E,"ax",@progbits
	.p2align	4, 0x90
	.type	_ZN4core9core_arch3x864avx220_mm256_movemask_epi817hac4b55d1f7e84c00E,@function
_ZN4core9core_arch3x864avx220_mm256_movemask_epi817hac4b55d1f7e84c00E:
	.cfi_startproc
	vmovdqa	(%rdi), %ymm0
	vpmovmskb	%ymm0, %eax
	vzeroupper
	retq

Cocalus · January 7, 2020, 9:30am

How are you building the Rust code? AVX2 isn't enabled by default since not all CPUs support it. One way to enable it is with
RUSTFLAGS='-C target-cpu=native' cargo build --release

The rust flags can be set in the top level crate with .cargo/config see Configuration - The Cargo Book

HadrienG · January 7, 2020, 9:41am

Yup, a quick godbolt experiment proves that bad target-cpu flags are the problem here.

@Response777 If you want to detect this problem more easily in the future, you may want to follow the suggestion of the core::arch module documentation to conditionally compile your SIMD functions only on CPU architectures where they are supported. This way, you'll get a clean compiler error when you forget to enable avx in your build instead of this kind of weird behavior.

This is one way to do it:

#[cfg(
    all(
        any(target_arch = "x86", target_arch = "x86_64"),
        target_feature = "avx2"
    )
)]
pub fn lower_bound_avx2(vec: Vec<i32>, value: i32) -> usize { /* ... */ }

Phlopsi · January 7, 2020, 9:56am

You also want to use

let addr = &vec.get_unchecked(i) as *const i32 as *const __m256i;

in your first loop. That will remove the bounds check that couldn't be optimized away automatically.

Response777 · January 7, 2020, 10:50am

Thanks a lot!

Response777 · January 7, 2020, 10:54am

Thanks, I added cfgs before. But the compiler just kept saying "cannot find function xxx", so I ended up dropping these flags...

cbiffle · January 7, 2020, 5:48pm

That's actually why you want the cfgs -- that was the compiler telling you that the target-cpu was wrong. (Though admittedly the compiler does not say this very clearly.)

system · April 6, 2020, 5:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SIMD help with code slower in Release mode help	7	1251	November 27, 2019
SIMD version of function runs slower than normal version help	16	1525	May 24, 2021
Looking for help understanding Rust's performance vs C++ community	28	8445	November 1, 2019
Why is Rust n-body benchmark slower than C++?	12	5662	January 12, 2023
We need to do better in the benchmarks game	6	1682	January 12, 2023

SIMD linear search slower than while loop?

Related topics