Well, all these one-byte-at-a-time moves nerd-sniped me
As usual, LLVM is really bad at merging unaligned loads and stores, so even though this is all fully-unrolled and such, I think we can convince it to do better.
Playground test that all the things I mention here do the same thing: https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=89af0bfed2ad4af72871e16e662c7401
Godbolt demo of the same things, for looking at assembly: https://rust.godbolt.org/z/da7jeqfv4
First, I was curious if just making it as obvious as possible work make it easier on LLVM:
pub fn convert_raw(bgra: &[u8; 16]) -> [u8; 12] {
let [
b0, g0, r0, _a0,
b1, g1, r1, _a1,
b2, g2, r2, _a2,
b3, g3, r3, _a3,
] = *bgra;
[
r0, g0, b0,
r1, g1, b1,
r2, g2, b2,
r3, g3, b3,
]
}
It didn't -- it produced the same as the quinedot's zip-chunks-exact -- but I like the clarity of that one and it made a great pinning test against which to check all the rest.
With i8x16 in std::simd - Rust available in nightly now, I figured I'd try that too, since that should definitely load up the whole thing at once. My first from_array+scatter was horrifying, even on a new target-CPU: https://rust.godbolt.org/z/8397zvT4z. The swizzle version is nice and tidy if you target something with SSSE3, but on the default x64 target it's still super-ugly: https://rust.godbolt.org/z/esbjxzzzY
In looking at that, I noticed that at the LLVM level, rustc is (currently) using i96
as the ABI return type for the [u8; 12]
. So that inspired me to phrase this as integer operations:
pub fn convert_via_shifting(bgra: &[u8; 16]) -> [u8; 12] {
let bgra = bgra.as_chunks().0;
let mut buffer = 0_u128;
for bgra in bgra.iter().cloned().rev() {
buffer <<= 24;
buffer |= u128::from(u32::from_be_bytes(bgra) >> 8);
}
buffer.to_le_bytes().as_chunks().0[0]
}
That comes out surprisingly well, including things like LLVM using SHLD
for the double-precision shift in the one place that actually needs it after the unrolling.
But as nice as that is, the shifts and ors still felt a bit unnecessary to me. Here's the version that produces the fewest assembly instruction of all of them, for the default x64 target:
pub fn convert_via_overwriting(bgra: &[u8; 16]) -> [u8; 12] {
let bgra = bgra.as_chunks().0;
let mut padded_rgb = [0; 13];
for i in (0..4).rev() {
let pixel = u32::from_be_bytes(bgra[i]);
padded_rgb[i*3..][..4].copy_from_slice(&pixel.to_le_bytes());
}
padded_rgb[1..].as_chunks().0[0]
}
Basically, this one copies the alpha around too -- since it's actually easier to copy 4 bytes than 3 -- in just the right order so the alpha never makes it into the output:
example::convert_via_overwriting:
sub rsp, 16
mov eax, dword ptr [rdi + 12]
bswap eax
mov dword ptr [rsp + 9], eax
mov eax, dword ptr [rdi + 8]
bswap eax
mov dword ptr [rsp + 6], eax
mov eax, dword ptr [rdi + 4]
bswap eax
mov dword ptr [rsp + 3], eax
mov eax, dword ptr [rdi]
bswap eax
mov dword ptr [rsp], eax
mov edx, dword ptr [rsp + 9]
mov rax, qword ptr [rsp + 1]
add rsp, 16
ret
Left as an exercise for the reader: benchmark all this with Criterion.rs - Criterion.rs Documentation to see whether there's a speed difference at all between these.