Why is SIMD _mm_rol_epi32 so slow?

I tried to use simd optimize sm3. In the algorithm, I tried to use _mm_rol_epi32 to describe P1(X) = X ⊕ (X <<< 15) ⊕ (X <<< 23)。At first, I use rotate_left to describe X <<< 15,but it costs too much time. Then I use _mm_rol_epi32 to replace the rotate_left and calculate 4 u32 values at the same time.

However, the time cost longer than before.

So I'm really confused, and hope anyone can give me some advice.

Can you show the actual code? That way we don't have to guess what you did. Also did you enable the avx512f and avx512vl target features? If not these functions won't be inlined (LLVM only inlines if the callee doesn't have any target features enabled that the caller doesn't).

4 Likes

You may need to set -C target-cpu=<cpu> in RUSTFLAGS to enable SIMD instructions.

2 Likes

Here is my code:

use std::arch::x86_64::*;
pub fn expand_four_w(w_origin: &[u32; 20], w0: usize, w7: usize, w13: usize, w3: usize, w10: usize) -> (){
    unsafe {
        let w_origin_ptr = w_origin.as_ptr();
        let w_first_quarter = _mm_loadu_si128(w_origin_ptr.add(w0) as *const __m128i);
        let w_second_quarter = _mm_loadu_si128(w_origin_ptr.add(w7) as *const __m128i);
        let w_third_quarter = _mm_loadu_si128(w_origin_ptr.add(w13) as *const __m128i);
        
        let w_third_quarter = _mm_rol_epi32::<15>(w_third_quarter);

        let p1_tmp = _mm_xor_si128(w_first_quarter, w_second_quarter);
        let p1 = _mm_xor_si128(p1_tmp, w_third_quarter);
        
        let p1_second = _mm_rol_epi32::<15>(p1);
        let p1_third = _mm_rol_epi32::<23>(p1);

        let p1_res_tmp = _mm_xor_si128(p1, p1_second);

        let p1_res = _mm_xor_si128(p1_res_tmp, p1_third);
        let w_fourth_quarter = _mm_loadu_si128(w_origin_ptr.add(w3) as *const __m128i);

        let w_fourth_quarter = _mm_rol_epi32::<7>(w_fourth_quarter);
        
        let w_result_tmp = _mm_xor_si128(p1_res, w_fourth_quarter);
        let w_fifth_quarter = _mm_loadu_si128(w_origin_ptr.add(w10) as *const __m128i);

        let w_result = _mm_xor_si128(w_result_tmp, w_fifth_quarter);

        _mm_storeu_si128(w_origin.as_ptr().add(w0) as *mut __m128i, w_result);

        if w0 == 0 {
            _mm_storeu_si128(w_origin.as_ptr().add(16) as *mut __m128i, w_result);
        }
    }
}

AND
I use RUSTFLAGS=-Ctarget-feature=+avx2 cargo test --release

I show my code below. And enable +avx2 target feature: RUSTFLAGS=-Ctarget-feature=+avx2 cargo test

Well according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_rol_epi32&ig_expand=5845, just avx2 isn't going to help you. It needs "AVX512F + AVX512VL".

I tired to use RUSTFLAGS=-Ctarget-feature=AVX512F + AVX512VL cargo test --release.

Terminal return +: command not found

Do you mean I need use

RUSTFLAGS=-Ctarget-feature=AVX512VL cargo test --release sm3_time_test

OR

RUSTFLAGS=-Ctarget-feature=AVX512F cargo test --release ?

I have tried to use these two flags, but the performance is not improved.

It did need to be RUSTFLAGS=-Ctarget-feature=+avx512f,+avx512vl. Also note that your CPU must support AVX512 for the resulting program to run. If you don't have a server CPU there is a large chance that it doesn't support AVX512.

1 Like

Fortunately, my server support AVX512. It is exactly faster than before.
Time:
before 2400ms
after 2100ms

However, if I dont use the _mm_rol_epi32, the time only need 1700ms. There is still a huge gap.