Will Rust/LLVM ever emit integer SSE instructions automatically?
Suppose I have the following function:
pub fn salsa20_8(state: &mut [u32; 16]) {
let mut x = *state;
for _ in 0..4 {
// column rounds
quarter_round(0, 4, 8, 12, &mut x);
quarter_round(5, 9, 13, 1, &mut x);
quarter_round(10, 14, 2, 6, &mut x);
quarter_round(15, 3, 7, 11, &mut x);
// diagonal rounds
quarter_round(0, 1, 2, 3, &mut x);
quarter_round(5, 6, 7, 4, &mut x);
quarter_round(10, 11, 8, 9, &mut x);
quarter_round(15, 12, 13, 14, &mut x);
}
for (s1, s0) in state.iter_mut().zip(&x) {
*s1 = s1.wrapping_add(*s0);
}
}
#[inline]
fn quarter_round(
a: usize,
b: usize,
c: usize,
d: usize,
state: &mut [u32; 16],
) {
let mut t: u32;
t = state[a].wrapping_add(state[d]);
state[b] ^= t.rotate_left(7) as u32;
t = state[b].wrapping_add(state[a]);
state[c] ^= t.rotate_left(9) as u32;
t = state[c].wrapping_add(state[b]);
state[d] ^= t.rotate_left(13) as u32;
t = state[d].wrapping_add(state[c]);
state[a] ^= t.rotate_left(18) as u32;
}
This function runs reasonably fast (~46ns/iter on my machine), but it can go faster:
pub fn salsa20_8_sse(state: &mut [__m128i; 4]) {
let mut x0 = state[0];
let mut x1 = state[1];
let mut x2 = state[2];
let mut x3 = state[3];
let mut t: __m128i;
unsafe {
for _ in 0..4 {
// Operate on "columns"
t = _mm_add_epi32(x0, x3);
x1 = _mm_xor_si128(x1, _mm_slli_epi32(t, 7));
x1 = _mm_xor_si128(x1, _mm_srli_epi32(t, 25));
t = _mm_add_epi32(x1, x0);
x2 = _mm_xor_si128(x2, _mm_slli_epi32(t, 9));
x2 = _mm_xor_si128(x2, _mm_srli_epi32(t, 23));
t = _mm_add_epi32(x2, x1);
x3 = _mm_xor_si128(x3, _mm_slli_epi32(t, 13));
x3 = _mm_xor_si128(x3, _mm_srli_epi32(t, 19));
t = _mm_add_epi32(x3, x2);
x0 = _mm_xor_si128(x0, _mm_slli_epi32(t, 18));
x0 = _mm_xor_si128(x0, _mm_srli_epi32(t, 14));
// Rearrange data.
x1 = _mm_shuffle_epi32(x1, 0x93); // 10_01_00_11
x2 = _mm_shuffle_epi32(x2, 0x4E); // 01_00_11_10
x3 = _mm_shuffle_epi32(x3, 0x39); // 00_11_10_01
// Operate on "rows".
t = _mm_add_epi32(x0, x1);
x3 = _mm_xor_si128(x3, _mm_slli_epi32(t, 7));
x3 = _mm_xor_si128(x3, _mm_srli_epi32(t, 25));
t = _mm_add_epi32(x3, x0);
x2 = _mm_xor_si128(x2, _mm_slli_epi32(t, 9));
x2 = _mm_xor_si128(x2, _mm_srli_epi32(t, 23));
t = _mm_add_epi32(x2, x3);
x1 = _mm_xor_si128(x1, _mm_slli_epi32(t, 13));
x1 = _mm_xor_si128(x1, _mm_srli_epi32(t, 19));
t = _mm_add_epi32(x1, x2);
x0 = _mm_xor_si128(x0, _mm_slli_epi32(t, 18));
x0 = _mm_xor_si128(x0, _mm_srli_epi32(t, 14));
// Rearrange data.
x1 = _mm_shuffle_epi32(x1, 0x39); // 00_11_10_01
x2 = _mm_shuffle_epi32(x2, 0x4E); // 01_00_11_10
x3 = _mm_shuffle_epi32(x3, 0x93); // 10_01_00_11
}
state[0] = _mm_add_epi32(state[0], x0);
state[1] = _mm_add_epi32(state[1], x1);
state[2] = _mm_add_epi32(state[2], x2);
state[3] = _mm_add_epi32(state[3], x3);
}
}
Using the SSE version gets the time down to 33ns/iter on my machine.
Now, I'd prefer not to use the SSE version directly. I'd rather write the code straightforwardly and expect the compiler to do its magic in optimizing it the best way possible. That way the code is both readable and future proof.
I know the compiler will vectorize some floating point operations automatically, allowing one to take advantage of SIMD instructions without mucking with them directly.
But I haven't seen the compiler emit many integer SIMD instructions (besides a few for XOR here and there). Does Rust/LLVM just not do that, or is there some way to contort the code to get it to recognize that SSE instructions can be used?
Details:
The trick of the SSE version is that you can perform the group of four
quarter_round
calls in parallel since there's no data dependency between them. i.e. it performs four t = state[a].wrapping_add(state[d]);
at once, then four t.rotate_left(7)
at once, then four state[b] ^= t
at once. Rinse and repeat. To accomplish this with SSE it does expect the input to be permuted relative to the reference implementation, so that the needed values end up next to each other and can be packed into __m128i easily. That's not an issue in practice.
I've tried re-writing the reference implementation such that it expects the permuted input and structures its operations like the SSE version, with a set of four additions, followed by four rotates, etc. But I didn't see Rust output any SIMD instructions.
I test on godbolt with rust 1.52.0
and -C opt-level=3 -C target-cpu=native