The fastest way to copy a buffer (BGRA to RGBA)

Hi, I have a BGRA buffer of size = width * height * 4 and type &[u8]. I need to get a copy of this buffer but the channels should be RGBA.

This code runs in ~10ms

let width = 1920;
let height = 1080;
let pixel_count = width * height;
let mut rgba_buffer = vec![0u8; pixel_count * 4];

for i in 0..pixel_count {
   let index = i * 4;
   rgba_buffer[index] = buffer[index + 2];
   rgba_buffer[index + 1] = buffer[index + 1];
   rgba_buffer[index + 2] = buffer[index];
   rgba_buffer[index + 3] = buffer[index + 3];
}

for comparison buffer.to_vec() runs in ~1ms.

How to make the code faster?

I don't know if this is the fastest method. I suspect copying larger chunks is generally better. Copying u64s is probably ideal, but you can do u32s easily by swapping endianness:

for (src, dst) in buffer.chunks_exact(4).zip(rgba_buffer.chunks_exact_mut(4)) {
    let rgba = u32::from_be_bytes(src.try_into().unwrap());
    dst.copy_from_slice(&rgba.to_le_bytes());
}
2 Likes

Excuse me if I am missing something, but I think that swaps all bytes 0, 1, 2, 3 -> 3, 2, 1, 0 instead of just swapping the zeroeth and second bytes 0, 1, 2, 3 -> 2, 1, 0, 3 like OP asked.

3 Likes

No, no, no. You are thinking in wrong direction.

All modern CPUs (x86-64 with SSSE3+, ARM with NEON, and RISC-V with vector extensions) have instructions that could do processing of at least four pixels in one instruction. And if you add microarchitecture levels then x86-64-v2 should be able to process four pixels in one instruction, x86-64-v3 should be able to process eighth, and x86-64-v4 should be able to process sixteen.

It looks as if clang can do that in x86-64-v2 mode and in x86-64-v3 mode but not in x86-64-v4 mode automatically, without using intrinsics.

Rust uses the same LLVM backend thus should be, in theory, capable of doing the same thing, isn't it?

But I have no idea how to do that, exact by calling intrinsics or assembler, directly. Naive implementation still does bazillion operations instead of one.

2 Likes

I think @parasyte's idea is the right direction. This code will get vectorized using (v)pshufb:

for (src, dst) in buffer.chunks_exact(4).zip(rgba_buffer.chunks_exact_mut(4)) {
    let [b, g, r, a] = src.try_into().unwrap();
    dst.copy_from_slice(&[r, g, b, a]);
}
7 Likes

You are right. One array is also vectorized with -Copt-level=3.

I guess I assumed that Rust would be closer to clang or gcc (where -O2 enabled vectorization), but apparently that's not how rust works…

And gcc vectorize with -O2 and even can process 16 pixels at once in x86-64-v4 mode while clang couldn't do that so it's not “Rust uses gcc-like options”.

@parasyte @jendrikw @SebastianJL thanks for the solution. I was actually testing in Debug mode :sweat_smile:. In Release mode my for loop works 2 times slower then buffer.to_vec() but your code works as fast as buffer.to_vec().

What does the vectorizing is, of course, the LLVM, and as far as I know rustc just passes the -O flag directly to it. -O2 absolutely does enable vectorization, but it all depends on how much work the optimizer puts into it. At -O2 it runs out of fuel faster than at -O3. Vectorization is not a binary setting, it's "how hard/long do you want me to work to find opportunities to vectorize".

It is not "endianness". G and A don't move.

I am absolutely not thinking in the wrong direction. Just made an honest mistake.

A different approach that also gets vectorized properly is to use strongly-typed buffers, and then rely on something like bytemuck when you need to talk to an API that expects [u8].

#[repr(C)]
#[derive(Copy,Clone)]
pub struct Rgba { r: u8, g:u8, b:u8, a:u8 }

#[repr(C)]
#[derive(Copy,Clone)]
pub struct Bgra { b: u8, g:u8, r:u8, a:u8 }

impl From<Rgba> for Bgra {
    fn from(Rgba { r, g, b, a}: Rgba)->Bgra {
        Bgra { b, g, r, a }
    }
}

pub fn typed_convert(buffer: &[Rgba]) -> Vec<Bgra> {
    buffer.iter().copied().map(Into::into).collect()
}
5 Likes

While it's not directly applicable -- you don't have the 4-bytes down to 3-bytes problem -- you might be interested in all the various approaches discussed in Converting a BGRA &[u8] to RGB [u8;N] (for images)? - #12 by scottmcm

Also, I wonder if you might consider doing it in-place, to avoid the allocation.

Sadly this doesn't do great, as it doesn't get unrolled at all:

pub fn demo(pixels: &mut[[u8; 4]]) {
    for p in pixels {
        let [b, g, r, a] = *p;
        *p = [r, g, b, a];
    }
}

But, because optimization is non-intuitive, this works great:

pub fn demo2(pixels: &mut[[u8; 4]]) {
    for p in pixels {
        let bgra = u32::from_be_bytes(*p);
        let argb = bgra.swap_bytes();
        let rgba = argb.rotate_left(8);
        *p = rgba.to_be_bytes();
    }
}

with a core loop on AVX2 of a bunch of smart 256-bit-wide shuffles https://rust.godbolt.org/z/f7s1rKxEY:

.LBB1_8:
        vmovdqu ymm1, ymmword ptr [rdi + 4*r8]
        vmovdqu ymm2, ymmword ptr [rdi + 4*r8 + 32]
        vmovdqu ymm3, ymmword ptr [rdi + 4*r8 + 64]
        vmovdqu ymm4, ymmword ptr [rdi + 4*r8 + 96]
        vpshufb ymm1, ymm1, ymm0
        vpshufb ymm2, ymm2, ymm0
        vpshufb ymm3, ymm3, ymm0
        vpshufb ymm4, ymm4, ymm0
        vmovdqu ymmword ptr [rdi + 4*r8], ymm1
        vmovdqu ymmword ptr [rdi + 4*r8 + 32], ymm2
        vmovdqu ymmword ptr [rdi + 4*r8 + 64], ymm3
        vmovdqu ymmword ptr [rdi + 4*r8 + 96], ymm4
        add     r8, 32
        cmp     rdx, r8
        jne     .LBB1_8

Ironically, that's even better than manually using SIMD, since this

#[no_mangle]
pub fn demo3(pixels: &mut[[u8; 4]]) {
    use std::simd::*;
    for p in pixels {
        let bgra = u8x4::from_array(*p);
        let rgba = simd_swizzle!(bgra, [2, 1, 0, 3]);
        *p = rgba.to_array();
    }
}

doesn't unroll, so while it uses a simd shuffle it just ends up wasting most of the vector register.

10 Likes