While it's not directly applicable -- you don't have the 4-bytes down to 3-bytes problem -- you might be interested in all the various approaches discussed in Converting a BGRA &[u8] to RGB [u8;N] (for images)? - #12 by scottmcm
Also, I wonder if you might consider doing it in-place, to avoid the allocation.
Sadly this doesn't do great, as it doesn't get unrolled at all:
pub fn demo(pixels: &mut[[u8; 4]]) {
for p in pixels {
let [b, g, r, a] = *p;
*p = [r, g, b, a];
}
}
But, because optimization is non-intuitive, this works great:
pub fn demo2(pixels: &mut[[u8; 4]]) {
for p in pixels {
let bgra = u32::from_be_bytes(*p);
let argb = bgra.swap_bytes();
let rgba = argb.rotate_left(8);
*p = rgba.to_be_bytes();
}
}
with a core loop on AVX2 of a bunch of smart 256-bit-wide shuffles https://rust.godbolt.org/z/f7s1rKxEY:
.LBB1_8:
vmovdqu ymm1, ymmword ptr [rdi + 4*r8]
vmovdqu ymm2, ymmword ptr [rdi + 4*r8 + 32]
vmovdqu ymm3, ymmword ptr [rdi + 4*r8 + 64]
vmovdqu ymm4, ymmword ptr [rdi + 4*r8 + 96]
vpshufb ymm1, ymm1, ymm0
vpshufb ymm2, ymm2, ymm0
vpshufb ymm3, ymm3, ymm0
vpshufb ymm4, ymm4, ymm0
vmovdqu ymmword ptr [rdi + 4*r8], ymm1
vmovdqu ymmword ptr [rdi + 4*r8 + 32], ymm2
vmovdqu ymmword ptr [rdi + 4*r8 + 64], ymm3
vmovdqu ymmword ptr [rdi + 4*r8 + 96], ymm4
add r8, 32
cmp rdx, r8
jne .LBB1_8
Ironically, that's even better than manually using SIMD, since this
#[no_mangle]
pub fn demo3(pixels: &mut[[u8; 4]]) {
use std::simd::*;
for p in pixels {
let bgra = u8x4::from_array(*p);
let rgba = simd_swizzle!(bgra, [2, 1, 0, 3]);
*p = rgba.to_array();
}
}
doesn't unroll, so while it uses a simd shuffle it just ends up wasting most of the vector register.