I tried to use simd optimize sm3. In the algorithm, I tried to use _mm_rol_epi32 to describe P1(X) = X ⊕ (X <<< 15) ⊕ (X <<< 23)。At first, I use rotate_left to describe X <<< 15,but it costs too much time. Then I use _mm_rol_epi32 to replace the rotate_left and calculate 4 u32 values at the same time.
However, the time cost longer than before.
So I'm really confused, and hope anyone can give me some advice.
Can you show the actual code? That way we don't have to guess what you did. Also did you enable the avx512f and avx512vl target features? If not these functions won't be inlined (LLVM only inlines if the callee doesn't have any target features enabled that the caller doesn't).
use std::arch::x86_64::*;
pub fn expand_four_w(w_origin: &[u32; 20], w0: usize, w7: usize, w13: usize, w3: usize, w10: usize) -> (){
unsafe {
let w_origin_ptr = w_origin.as_ptr();
let w_first_quarter = _mm_loadu_si128(w_origin_ptr.add(w0) as *const __m128i);
let w_second_quarter = _mm_loadu_si128(w_origin_ptr.add(w7) as *const __m128i);
let w_third_quarter = _mm_loadu_si128(w_origin_ptr.add(w13) as *const __m128i);
let w_third_quarter = _mm_rol_epi32::<15>(w_third_quarter);
let p1_tmp = _mm_xor_si128(w_first_quarter, w_second_quarter);
let p1 = _mm_xor_si128(p1_tmp, w_third_quarter);
let p1_second = _mm_rol_epi32::<15>(p1);
let p1_third = _mm_rol_epi32::<23>(p1);
let p1_res_tmp = _mm_xor_si128(p1, p1_second);
let p1_res = _mm_xor_si128(p1_res_tmp, p1_third);
let w_fourth_quarter = _mm_loadu_si128(w_origin_ptr.add(w3) as *const __m128i);
let w_fourth_quarter = _mm_rol_epi32::<7>(w_fourth_quarter);
let w_result_tmp = _mm_xor_si128(p1_res, w_fourth_quarter);
let w_fifth_quarter = _mm_loadu_si128(w_origin_ptr.add(w10) as *const __m128i);
let w_result = _mm_xor_si128(w_result_tmp, w_fifth_quarter);
_mm_storeu_si128(w_origin.as_ptr().add(w0) as *mut __m128i, w_result);
if w0 == 0 {
_mm_storeu_si128(w_origin.as_ptr().add(16) as *mut __m128i, w_result);
}
}
}
AND
I use RUSTFLAGS=-Ctarget-feature=+avx2 cargo test --release
It did need to be RUSTFLAGS=-Ctarget-feature=+avx512f,+avx512vl. Also note that your CPU must support AVX512 for the resulting program to run. If you don't have a server CPU there is a large chance that it doesn't support AVX512.