Can the compiler infer SSE instructions?

fpgaminer · May 19, 2021, 12:22am

Will Rust/LLVM ever emit integer SSE instructions automatically?

Suppose I have the following function:

pub fn salsa20_8(state: &mut [u32; 16]) {
    let mut x = *state;

    for _ in 0..4 {
        // column rounds
        quarter_round(0, 4, 8, 12, &mut x);
        quarter_round(5, 9, 13, 1, &mut x);
        quarter_round(10, 14, 2, 6, &mut x);
        quarter_round(15, 3, 7, 11, &mut x);

        // diagonal rounds
        quarter_round(0, 1, 2, 3, &mut x);
        quarter_round(5, 6, 7, 4, &mut x);
        quarter_round(10, 11, 8, 9, &mut x);
        quarter_round(15, 12, 13, 14, &mut x);
    }

    for (s1, s0) in state.iter_mut().zip(&x) {
        *s1 = s1.wrapping_add(*s0);
    }
}


#[inline]
fn quarter_round(
    a: usize,
    b: usize,
    c: usize,
    d: usize,
    state: &mut [u32; 16],
) {
    let mut t: u32;

    t = state[a].wrapping_add(state[d]);
    state[b] ^= t.rotate_left(7) as u32;

    t = state[b].wrapping_add(state[a]);
    state[c] ^= t.rotate_left(9) as u32;

    t = state[c].wrapping_add(state[b]);
    state[d] ^= t.rotate_left(13) as u32;

    t = state[d].wrapping_add(state[c]);
    state[a] ^= t.rotate_left(18) as u32;
}

This function runs reasonably fast (~46ns/iter on my machine), but it can go faster:

pub fn salsa20_8_sse(state: &mut [__m128i; 4]) {
    let mut x0 = state[0];
    let mut x1 = state[1];
    let mut x2 = state[2];
    let mut x3 = state[3];
    let mut t: __m128i;

    unsafe {
        for _ in 0..4 {
            // Operate on "columns"
            t = _mm_add_epi32(x0, x3);
            x1 = _mm_xor_si128(x1, _mm_slli_epi32(t, 7));
            x1 = _mm_xor_si128(x1, _mm_srli_epi32(t, 25));
            t = _mm_add_epi32(x1, x0);
            x2 = _mm_xor_si128(x2, _mm_slli_epi32(t, 9));
            x2 = _mm_xor_si128(x2, _mm_srli_epi32(t, 23));
            t = _mm_add_epi32(x2, x1);
            x3 = _mm_xor_si128(x3, _mm_slli_epi32(t, 13));
            x3 = _mm_xor_si128(x3, _mm_srli_epi32(t, 19));
            t = _mm_add_epi32(x3, x2);
            x0 = _mm_xor_si128(x0, _mm_slli_epi32(t, 18));
            x0 = _mm_xor_si128(x0, _mm_srli_epi32(t, 14));

            // Rearrange data.
            x1 = _mm_shuffle_epi32(x1, 0x93);  // 10_01_00_11
            x2 = _mm_shuffle_epi32(x2, 0x4E);  // 01_00_11_10
            x3 = _mm_shuffle_epi32(x3, 0x39);  // 00_11_10_01

            // Operate on "rows".
            t = _mm_add_epi32(x0, x1);
            x3 = _mm_xor_si128(x3, _mm_slli_epi32(t, 7));
            x3 = _mm_xor_si128(x3, _mm_srli_epi32(t, 25));
            t = _mm_add_epi32(x3, x0);
            x2 = _mm_xor_si128(x2, _mm_slli_epi32(t, 9));
            x2 = _mm_xor_si128(x2, _mm_srli_epi32(t, 23));
            t = _mm_add_epi32(x2, x3);
            x1 = _mm_xor_si128(x1, _mm_slli_epi32(t, 13));
            x1 = _mm_xor_si128(x1, _mm_srli_epi32(t, 19));
            t = _mm_add_epi32(x1, x2);
            x0 = _mm_xor_si128(x0, _mm_slli_epi32(t, 18));
            x0 = _mm_xor_si128(x0, _mm_srli_epi32(t, 14));

            // Rearrange data.
            x1 = _mm_shuffle_epi32(x1, 0x39);  // 00_11_10_01
            x2 = _mm_shuffle_epi32(x2, 0x4E);  // 01_00_11_10
            x3 = _mm_shuffle_epi32(x3, 0x93);  // 10_01_00_11
        }

        state[0] = _mm_add_epi32(state[0], x0);
        state[1] = _mm_add_epi32(state[1], x1);
        state[2] = _mm_add_epi32(state[2], x2);
        state[3] = _mm_add_epi32(state[3], x3);
    }
}

Using the SSE version gets the time down to 33ns/iter on my machine.

Now, I'd prefer not to use the SSE version directly. I'd rather write the code straightforwardly and expect the compiler to do its magic in optimizing it the best way possible. That way the code is both readable and future proof.

I know the compiler will vectorize some floating point operations automatically, allowing one to take advantage of SIMD instructions without mucking with them directly.

But I haven't seen the compiler emit many integer SIMD instructions (besides a few for XOR here and there). Does Rust/LLVM just not do that, or is there some way to contort the code to get it to recognize that SSE instructions can be used?

Details:

The trick of the SSE version is that you can perform the group of four
quarter_round calls in parallel since there's no data dependency between them. i.e. it performs four t = state[a].wrapping_add(state[d]); at once, then four t.rotate_left(7) at once, then four state[b] ^= t at once. Rinse and repeat. To accomplish this with SSE it does expect the input to be permuted relative to the reference implementation, so that the needed values end up next to each other and can be packed into __m128i easily. That's not an issue in practice.

I've tried re-writing the reference implementation such that it expects the permuted input and structures its operations like the SSE version, with a set of four additions, followed by four rotates, etc. But I didn't see Rust output any SIMD instructions.

I test on godbolt with rust 1.52.0 and -C opt-level=3 -C target-cpu=native

fpgaminer · May 19, 2021, 3:28am

Update:

I've made some progress, but still running into a road block. Here's the current work:

pub fn salsa20_8_permuted3(state: &mut [u32; 16]) {
    fn thingie(dst: &mut [u32], a: &[u32], b: &[u32], n: u32) {
        for i in 0..4 {
            dst[i] ^= a[i].wrapping_add(b[i]).rotate_left(n);
        }
    }

    let mut x0 = [state[0], state[1], state[2], state[3]];
    let mut x1 = [state[4], state[5], state[6], state[7]];
    let mut x2 = [state[8], state[9], state[10], state[11]];
    let mut x3 = [state[12], state[13], state[14], state[15]];

    thingie(&mut x1, &x0, &x3, 7);
    thingie(&mut x2, &x1, &x0, 9);
    thingie(&mut x3, &x2, &x1, 13);
    thingie(&mut x0, &x3, &x2, 18);

    x1 = [x1[3], x1[0], x1[1], x1[2]];
    x2 = [x2[2], x2[3], x2[0], x2[1]];
    x3 = [x3[1], x3[2], x3[3], x3[0]];

    thingie(&mut x3, &x0, &x1, 7);
    thingie(&mut x2, &x3, &x0, 9);
    thingie(&mut x1, &x2, &x3, 13);
    //thingie(&mut x0, &x1, &x2, 18);

    x1 = [x1[1], x1[2], x1[3], x1[0]];
    x2 = [x2[2], x2[3], x2[0], x2[1]];
    x3 = [x3[3], x3[0], x3[1], x3[2]];

    state[0] = state[0].wrapping_add(x0[0]);
    state[1] = state[1].wrapping_add(x0[1]);
    state[2] = state[2].wrapping_add(x0[2]);
    state[3] = state[3].wrapping_add(x0[3]);
    state[4] = state[4].wrapping_add(x1[0]);
    state[5] = state[5].wrapping_add(x1[1]);
    state[6] = state[6].wrapping_add(x1[2]);
    state[7] = state[7].wrapping_add(x1[3]);
    state[8] = state[8].wrapping_add(x2[0]);
    state[9] = state[9].wrapping_add(x2[1]);
    state[10] = state[10].wrapping_add(x2[2]);
    state[11] = state[11].wrapping_add(x2[3]);
    state[12] = state[12].wrapping_add(x3[0]);
    state[13] = state[13].wrapping_add(x3[1]);
    state[14] = state[14].wrapping_add(x3[2]);
    state[15] = state[15].wrapping_add(x3[3]);
}

Play with it in godbolt:

With that one line commented it emits AVX instructions in a nice, tight function. Perfect. But if that line is uncommented, it suddenly explodes into the usual mass of scalar instructions. Very strange.

Anyone have ideas as to what's going on here? Is the compiler running into some kind of "if the code is bigger than LIMIT then don't use AVX"?

kornel · May 19, 2021, 9:20am

It looks like some limitation of LLVM optimizer. Splitting the code into two smaller never-inline functions optimizes each function fine.

fpgaminer · May 20, 2021, 6:14am

Thank you. Yeah, seems like some optimizer is giving up too early.

I took your idea and coded up a complete example to plug into my benchmark. Here's the complete code for that variation on godbolt: Compiler Explorer

Unfortunately it runs slower than the straightforward implementation by a few ns.

At this point should I try raising this code up to the Rust compiler team to see if they're interested in exploring where LLVM is tripping up? Or is this unlikely to be interesting?

kornel · May 20, 2021, 9:00am

It's most likely an LLVM bug/limitation, not a Rust issue. It would be great if you could reproduce it using C code or LLVM IR, so that it can be reported to the LLVM project.

fpgaminer · May 20, 2021, 10:39pm

Good call on the C code suggestion. I was going to make a bug report and re-wrote in C, only to discover something useful. The bug reproduces under x86-64 clang 12.0.0, but not on x86-64 clang (trunk). So seems like it's already fixed.

Now just gotta wait for that to make its way into an LLVM release, and then wait for Rust to upgrade. Probably gonna be a hot minute, but *shrug*.

For reference, here's the C code on godbolt:

system · August 18, 2021, 10:39pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Allow for instruction pipelining in Rust help	2	275	March 6, 2024
Question about SSE/RVd/RVf/RVq... and related processor extensions regarding floating-point conversions help	3	450	September 5, 2021
Is there any similar SIMD function like wrapping_add in Rust? help	3	449	October 11, 2022
Where can I find the complete implementation for f32::exp? help	8	810	May 25, 2020
XORing 128-bit Blocks - Function with nearly identical asm takes 5x as long help	3	692	January 14, 2023

Can the compiler infer SSE instructions?

Related topics