The fastest way to copy a buffer (BGRA to RGBA)

While it's not directly applicable -- you don't have the 4-bytes down to 3-bytes problem -- you might be interested in all the various approaches discussed in Converting a BGRA &[u8] to RGB [u8;N] (for images)? - #12 by scottmcm

Also, I wonder if you might consider doing it in-place, to avoid the allocation.

Sadly this doesn't do great, as it doesn't get unrolled at all:

pub fn demo(pixels: &mut[[u8; 4]]) {
    for p in pixels {
        let [b, g, r, a] = *p;
        *p = [r, g, b, a];
    }
}

But, because optimization is non-intuitive, this works great:

pub fn demo2(pixels: &mut[[u8; 4]]) {
    for p in pixels {
        let bgra = u32::from_be_bytes(*p);
        let argb = bgra.swap_bytes();
        let rgba = argb.rotate_left(8);
        *p = rgba.to_be_bytes();
    }
}

with a core loop on AVX2 of a bunch of smart 256-bit-wide shuffles https://rust.godbolt.org/z/f7s1rKxEY:

.LBB1_8:
        vmovdqu ymm1, ymmword ptr [rdi + 4*r8]
        vmovdqu ymm2, ymmword ptr [rdi + 4*r8 + 32]
        vmovdqu ymm3, ymmword ptr [rdi + 4*r8 + 64]
        vmovdqu ymm4, ymmword ptr [rdi + 4*r8 + 96]
        vpshufb ymm1, ymm1, ymm0
        vpshufb ymm2, ymm2, ymm0
        vpshufb ymm3, ymm3, ymm0
        vpshufb ymm4, ymm4, ymm0
        vmovdqu ymmword ptr [rdi + 4*r8], ymm1
        vmovdqu ymmword ptr [rdi + 4*r8 + 32], ymm2
        vmovdqu ymmword ptr [rdi + 4*r8 + 64], ymm3
        vmovdqu ymmword ptr [rdi + 4*r8 + 96], ymm4
        add     r8, 32
        cmp     rdx, r8
        jne     .LBB1_8

Ironically, that's even better than manually using SIMD, since this

#[no_mangle]
pub fn demo3(pixels: &mut[[u8; 4]]) {
    use std::simd::*;
    for p in pixels {
        let bgra = u8x4::from_array(*p);
        let rgba = simd_swizzle!(bgra, [2, 1, 0, 3]);
        *p = rgba.to_array();
    }
}

doesn't unroll, so while it uses a simd shuffle it just ends up wasting most of the vector register.

10 Likes