Why is SIMD _mm_rol_epi32 so slow?

Ezio · July 6, 2022, 8:24am

I tried to use simd optimize sm3. In the algorithm, I tried to use _mm_rol_epi32 to describe P1(X) = X ⊕ (X <<< 15) ⊕ (X <<< 23)。At first, I use rotate_left to describe X <<< 15，but it costs too much time. Then I use _mm_rol_epi32 to replace the rotate_left and calculate 4 u32 values at the same time.

However, the time cost longer than before.

So I'm really confused, and hope anyone can give me some advice.

bjorn3 · July 6, 2022, 10:30am

Can you show the actual code? That way we don't have to guess what you did. Also did you enable the avx512f and avx512vl target features? If not these functions won't be inlined (LLVM only inlines if the callee doesn't have any target features enabled that the caller doesn't).

kornel · July 6, 2022, 4:34pm

You may need to set -C target-cpu=<cpu> in RUSTFLAGS to enable SIMD instructions.

Ezio · July 7, 2022, 1:19am

Here is my code:

use std::arch::x86_64::*;
pub fn expand_four_w(w_origin: &[u32; 20], w0: usize, w7: usize, w13: usize, w3: usize, w10: usize) -> (){
    unsafe {
        let w_origin_ptr = w_origin.as_ptr();
        let w_first_quarter = _mm_loadu_si128(w_origin_ptr.add(w0) as *const __m128i);
        let w_second_quarter = _mm_loadu_si128(w_origin_ptr.add(w7) as *const __m128i);
        let w_third_quarter = _mm_loadu_si128(w_origin_ptr.add(w13) as *const __m128i);
        
        let w_third_quarter = _mm_rol_epi32::<15>(w_third_quarter);

        let p1_tmp = _mm_xor_si128(w_first_quarter, w_second_quarter);
        let p1 = _mm_xor_si128(p1_tmp, w_third_quarter);
        
        let p1_second = _mm_rol_epi32::<15>(p1);
        let p1_third = _mm_rol_epi32::<23>(p1);

        let p1_res_tmp = _mm_xor_si128(p1, p1_second);

        let p1_res = _mm_xor_si128(p1_res_tmp, p1_third);
        let w_fourth_quarter = _mm_loadu_si128(w_origin_ptr.add(w3) as *const __m128i);

        let w_fourth_quarter = _mm_rol_epi32::<7>(w_fourth_quarter);
        
        let w_result_tmp = _mm_xor_si128(p1_res, w_fourth_quarter);
        let w_fifth_quarter = _mm_loadu_si128(w_origin_ptr.add(w10) as *const __m128i);

        let w_result = _mm_xor_si128(w_result_tmp, w_fifth_quarter);

        _mm_storeu_si128(w_origin.as_ptr().add(w0) as *mut __m128i, w_result);

        if w0 == 0 {
            _mm_storeu_si128(w_origin.as_ptr().add(16) as *mut __m128i, w_result);
        }
    }
}

AND
I use RUSTFLAGS=-Ctarget-feature=+avx2 cargo test --release

Ezio · July 7, 2022, 1:21am

I show my code below. And enable +avx2 target feature: RUSTFLAGS=-Ctarget-feature=+avx2 cargo test

scottmcm · July 7, 2022, 4:36am

Well according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_rol_epi32&ig_expand=5845, just avx2 isn't going to help you. It needs "AVX512F + AVX512VL".

Ezio · July 7, 2022, 6:52am

I tired to use RUSTFLAGS=-Ctarget-feature=AVX512F + AVX512VL cargo test --release.

Terminal return +: command not found

Do you mean I need use

RUSTFLAGS=-Ctarget-feature=AVX512VL cargo test --release sm3_time_test

OR

RUSTFLAGS=-Ctarget-feature=AVX512F cargo test --release ?

I have tried to use these two flags, but the performance is not improved.

bjorn3 · July 7, 2022, 8:26am

It did need to be RUSTFLAGS=-Ctarget-feature=+avx512f,+avx512vl. Also note that your CPU must support AVX512 for the resulting program to run. If you don't have a server CPU there is a large chance that it doesn't support AVX512.

Ezio · July 7, 2022, 9:14am

Fortunately, my server support AVX512. It is exactly faster than before.
Time:
before 2400ms
after 2100ms

However, if I dont use the _mm_rol_epi32, the time only need 1700ms. There is still a huge gap.

system · October 5, 2022, 9:14am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Why do my code become slower after applying SIMD instruction？ help	3	493	October 12, 2022
Improved SIMD in Rust	7	3371	January 12, 2023
SIMD version of function runs slower than normal version help	16	1183	May 24, 2021
Blog: Rust Faster – SIMD Edition announcements	14	2059	January 12, 2023
SIMD linear search slower than while loop? help	11	1833	April 6, 2020

Why is SIMD _mm_rol_epi32 so slow?

Related Topics