Why do my code become slower after applying SIMD instruction？

Ezio · July 14, 2022, 7:48am

before I trying to optimize the code， the code is:

fn round_one_phase1(ra: &mut u32, rb: &mut u32, rc: &mut u32, rd: &mut u32,
                    re: u32, rf: &mut u32, rg: &mut u32, rh: &mut u32,
                    j: usize, w1: &mut u32, w2: &mut u32){
    let mut tt2 = ra.rotate_left(12);
    let mut tt1 = tt2.wrapping_add(re).wrapping_add(T[j]);

    tt1 = tt1.rotate_left(7);
    tt2 = tt2 ^ tt1;

    *rd = ff1(*ra, *rb, *rc)
        .wrapping_add(*rd)
        .wrapping_add(tt2)
        .wrapping_add(*w2);
    *rh = gg1(re, *rf, *rg)
        .wrapping_add(*rh)
        .wrapping_add(tt1)
        .wrapping_add(*w1);
    *rb = rb.rotate_left(9);
    *rf = rf.rotate_left(19);
    *rh = p0(*rh);
}

I try to decrease the add num, so I use SIMD instruction. Here is my code:

fn round_one_phase1(ra: &mut u32, rb: &mut u32, rc: &mut u32, rd: &mut u32,
                    re: u32, rf: &mut u32, rg: &mut u32, rh: &mut u32,
                    j: usize, w1: &mut u32, w2: &mut u32){
    let mut tt2 = ra.rotate_left(12);
    let mut tt1 = tt2.wrapping_add(re).wrapping_add(T[j]);

    tt1 = tt1.rotate_left(7);
    tt2 = tt2 ^ tt1;

    unsafe {
        let rd_rh_part1= _mm_setr_epi32(ff1(*ra, *rb, *rc) as i32, gg1(re, *rf, *rg) as i32, tt2 as i32, tt1 as i32);
        let rd_rh_part2 = _mm_setr_epi32(*rd as i32, *rh as i32, *w2 as i32, *w1 as i32);
        
        let rd_rh_res_part1 = _mm_add_epi32(rd_rh_part1, rd_rh_part2);
        
        let rd_rh_res_add1 = _mm_setr_epi32(_mm_extract_epi32::<0>(rd_rh_res_part1), _mm_extract_epi32::<1>(rd_rh_res_part1), 0, 0);
        let rd_rh_res_add2 = _mm_setr_epi32(_mm_extract_epi32::<2>(rd_rh_res_part1), _mm_extract_epi32::<3>(rd_rh_res_part1), 0, 0);
        
        let rd_rh_res =_mm_add_epi32(rd_rh_res_add1, rd_rh_res_add2);
        
        *rd = _mm_extract_epi32::<0>(rd_rh_res) as u32;
        *rh = _mm_extract_epi32::<1>(rd_rh_res) as u32;
    }
    *rb = rb.rotate_left(9);
    *rf = rf.rotate_left(19);
    *rh = p0(*rh);
}

Then I use below to run code.

RUSTFLAGS=-Ctarget-feature=+avx2,+sse2,+sse4.1 cargo test --release sm3_time_test -p ylong_sm3 -- --nocapture

Before I changed the code, the time is 1543.606
After I changed the code, the time is 1860

Why does time take longer?

So Im very confused.

Can any one give me some adive?
Thanks!!!

Michael-F-Bryan · July 14, 2022, 8:14am

Out of curiosity, does LLVM automatically generate SIMD instructions for the first version? The compiler might have actually applied these optimisations already and your hand-rolled SIMD isn't as efficient.

Also, if you can create a version of the code that runs on the playground or godbolt then people will be able to investigate for themselves. Your snippet doesn't include T and ff1, so all I can do is look at the code and make random guesses about performance.

SkiFire13 · July 14, 2022, 10:04am

You'are actually doing very few things in parallel here (only two adds) while paying the cost of loading the values in the simd register. The tradeoff is simply not worth it.

system · October 12, 2022, 10:05am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
SIMD version of function runs slower than normal version help	16	1676	May 24, 2021
SIMD help with code slower in Release mode help	7	1326	November 27, 2019
SIMD linear search slower than while loop? help	11	2040	April 6, 2020
Why is SIMD _mm_rol_epi32 so slow? embedded	9	981	October 5, 2022
Rust auto-vectorization is 9000% slower help	5	422	July 17, 2025

Why do my code become slower after applying SIMD instruction？

Related topics