Converting a BGRA &[u8] to RGB [u8;N] (for images)?

So I want to take the output of fast_image_resize's Image buffer, a &[u8] of BGRA pixels, and transform it into an RGB array (well, a vector works as well which I can then convert).

I'm trying to get it working through the rgb crate which seems to be able to do so but I'm stuck on how to satisfy .into(). My current approach:

// bgra is a &[BGRA<u8>]
let bgra = dst_image.buffer().as_bgra();
let rgb: rgb::RGB8 = bgra.iter().map(|pixel| pixel.into()).collect();

Yields:

error[E0277]: the trait bound `u8: From<&BGRA<u8>>` is not satisfied
   --> src/test.rs:50:59
    |
147 | ...                   let rgb: rgb::RGB8 = bgra.iter().map(|pixel| pixel.into()).collect();
    |                                                                          ^^^^ the trait `From<&BGRA<u8>>` is not implemented for `u8`
    |
    = help: the following implementations were found:
              <u8 as From<NonZeroU8>>
              <u8 as From<bool>>
    = note: required because of the requirements on the impl of `Into<u8>` for `&BGRA<u8>`

Converting &[u8] to [u8; N] is possible through an existing TryFrom implementation in the standard library. It's a fallible conversion because it only succeeds if N is indeed the correct length. If your code is structured in a way such that you know that it will never fail, you can just use try_into()..unwrap() to do the conversion.

fn demonstration(x: &[u8]) -> [u8; 13] {
    x.try_into().unwrap()
}

(playground)

1 Like

Whoops, sould've probably made the title less generic. My problem mostly lies in the bgra -> rgb part of the conversion, which I'm not sure how to approach. (Hence the broadness of the original question).

On a sidenote, it's guaranteed the &[u8] will always be 4 by 1 pixels, if that changes anything.

Ah, I see. I'm still not fully understanding your task. Your code seems to take a &[u8], turns it into &[BGRA<u8>] and then only wants a single RGB<u8> out of it. Is the original &[u8] only 4 elements long or do you want multiple RBG<u8> values out of it in the end?

Edit: Or perhaps the type let rgb: rgb::RGB8 = ... is inaccurate.. it the length known at compiler time? And do you really want an array or is the size of the data so large that a vector on the heap is more appropriate anyways to avoid stack overflows?

Edit2: Ah, small and known size

nevermind, understood.

Oh yeah, that's definitely wrong. It probably should be [RGB8;4]/Vec<RGB8> for it to make any sense to start with for my use-case.

Maybe just write a loop like

let rgb: [RGB8; 4] = <_>::default();
for i in 0..4 {
    rgb[i] = RGBA::from(bgra[i]).rgb()
}

Took inspiration from just doing a loop, ended up manually doing the conversion. Thanks for the help anyways!

let bgr_arr: [u8; 16] = dst_image.buffer().try_into().unwrap();
let mut rgb: [u8; 12] = [0; 12];

for i in (0..12).step_by(3) {
	rgb[i] = bgr_arr[i / 3 * 4 + 2];
	rgb[i + 1] = bgr_arr[i / 3 * 4 + 1];
	rgb[i + 2] = bgr_arr[i / 3 * 4]
}

(Pretty sure there should be some more optimized way of doing this out there though)

I'm pretty sure that all of the approaches above, yours or the one I presented should be fully optimized properly, the whole loop unrolled, etc; don't worry about performance here.

(Except you might be referring to cleaner code in terms of what to "optimize".)

(Approaches that use Vec might be less efficient, because allocations tend not to be optimized away; but we haven't discussed any approach using Vecs, so that's not a problem.)

1 Like

Here's a couple approaches using chunks_exact (one allocating, one not).

3 Likes

Gotta love the simplicity of the assembly from something like

pub fn convert(bgr_arr: &[u8; 16]) -> [u8; 12] {
    let mut rgb: [u8; 12] = [0; 12];
    for (src, dst) in bgr_arr.chunks_exact(4).zip(rgb.chunks_exact_mut(3)) {
        dst[0] = src[2];
        dst[1] = src[1];
        dst[2] = src[0];
    }
    rgb
}
playground::convert:
	pushq	%r15
	pushq	%r14
	pushq	%rbx
	movzbl	2(%rdi), %r8d
	movzbl	1(%rdi), %eax
	movzbl	(%rdi), %r9d
	movzbl	6(%rdi), %r10d
	movzbl	5(%rdi), %r11d
	movzbl	4(%rdi), %ecx
	movzbl	10(%rdi), %esi
	movzbl	9(%rdi), %r15d
	movzbl	8(%rdi), %r14d
	movzbl	14(%rdi), %edx
	movzbl	13(%rdi), %ebx
	movzbl	12(%rdi), %edi
	shlq	$24, %rdi
	shlq	$16, %rbx
	orq	%rdi, %rbx
	shlq	$8, %rdx
	orq	%rbx, %rdx
	orq	%r14, %rdx
	shlq	$56, %r15
	shlq	$48, %rsi
	orq	%r15, %rsi
	shlq	$40, %rcx
	orq	%rsi, %rcx
	shlq	$32, %r11
	orq	%rcx, %r11
	shlq	$24, %r10
	orq	%r11, %r10
	shlq	$16, %r9
	orq	%r10, %r9
	shlq	$8, %rax
	orq	%r9, %rax
	orq	%r8, %rax
	popq	%rbx
	popq	%r14
	popq	%r15
	retq
let mut rgb: [u8; 12] = [0; 12];
for (src, dst) in bgr_arr.chunks_exact(4).zip(rgb.chunks_exact_mut(3)) {
    dst[0] = src[2];
    dst[1] = src[1];
    dst[2] = src[0];
}

Dang, this sure is something. Don't know if its specific to Rust, but I sure enjoy syntax like this. It just "makes sense".

2 Likes

Well, all these one-byte-at-a-time moves nerd-sniped me :sweat_smile: As usual, LLVM is really bad at merging unaligned loads and stores, so even though this is all fully-unrolled and such, I think we can convince it to do better.

Playground test that all the things I mention here do the same thing: https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=89af0bfed2ad4af72871e16e662c7401
Godbolt demo of the same things, for looking at assembly: https://rust.godbolt.org/z/da7jeqfv4

First, I was curious if just making it as obvious as possible work make it easier on LLVM:

pub fn convert_raw(bgra: &[u8; 16]) -> [u8; 12] {
    let [
        b0, g0, r0, _a0,
        b1, g1, r1, _a1,
        b2, g2, r2, _a2,
        b3, g3, r3, _a3,
    ] = *bgra;
    [
        r0, g0, b0,
        r1, g1, b1,
        r2, g2, b2,
        r3, g3, b3,
    ]
}

It didn't -- it produced the same as the quinedot's zip-chunks-exact -- but I like the clarity of that one and it made a great pinning test against which to check all the rest.

With i8x16 in std::simd - Rust available in nightly now, I figured I'd try that too, since that should definitely load up the whole thing at once. My first from_array+scatter was horrifying, even on a new target-CPU: https://rust.godbolt.org/z/8397zvT4z. The swizzle version is nice and tidy if you target something with SSSE3, but on the default x64 target it's still super-ugly: https://rust.godbolt.org/z/esbjxzzzY

In looking at that, I noticed that at the LLVM level, rustc is (currently) using i96 as the ABI return type for the [u8; 12]. So that inspired me to phrase this as integer operations:

pub fn convert_via_shifting(bgra: &[u8; 16]) -> [u8; 12] {
    let bgra = bgra.as_chunks().0;
    let mut buffer = 0_u128;
    for bgra in bgra.iter().cloned().rev() {
        buffer <<= 24;
        buffer |= u128::from(u32::from_be_bytes(bgra) >> 8);
    }
    buffer.to_le_bytes().as_chunks().0[0]
}

That comes out surprisingly well, including things like LLVM using SHLD for the double-precision shift in the one place that actually needs it after the unrolling.

But as nice as that is, the shifts and ors still felt a bit unnecessary to me. Here's the version that produces the fewest assembly instruction of all of them, for the default x64 target:

pub fn convert_via_overwriting(bgra: &[u8; 16]) -> [u8; 12] {
    let bgra = bgra.as_chunks().0;
    let mut padded_rgb = [0; 13];
    for i in (0..4).rev() {
        let pixel = u32::from_be_bytes(bgra[i]);
        padded_rgb[i*3..][..4].copy_from_slice(&pixel.to_le_bytes());
    }

    padded_rgb[1..].as_chunks().0[0]
}

Basically, this one copies the alpha around too -- since it's actually easier to copy 4 bytes than 3 -- in just the right order so the alpha never makes it into the output:

example::convert_via_overwriting:
        sub     rsp, 16
        mov     eax, dword ptr [rdi + 12]
        bswap   eax
        mov     dword ptr [rsp + 9], eax
        mov     eax, dword ptr [rdi + 8]
        bswap   eax
        mov     dword ptr [rsp + 6], eax
        mov     eax, dword ptr [rdi + 4]
        bswap   eax
        mov     dword ptr [rsp + 3], eax
        mov     eax, dword ptr [rdi]
        bswap   eax
        mov     dword ptr [rsp], eax
        mov     edx, dword ptr [rsp + 9]
        mov     rax, qword ptr [rsp + 1]
        add     rsp, 16
        ret

Left as an exercise for the reader: benchmark all this with Criterion.rs - Criterion.rs Documentation to see whether there's a speed difference at all between these.

2 Likes

Sure, why not:

Without target-cpu=native:

With target-cpu=native:

(This is the size magick gave for the SVG->webp conversions. Order on graph =
chunks, loop, overwrite, raw, scatter, shift, swizzle)


With, target-cpu=native, performance

  • no change
    • bgra×4_to_rgb×4 (by move)/loop (+2.53%, p = 0.06)
    • bgra×4_to_rgb×4 (by move)/scatter (-1.14%, p = 0.06)
    • bgra×4_to_rgb×4 (by ref)/raw (+2.13%, p = 0.38)
    • bgra×4_to_rgb×4 (by ref)/overwrite (+0.126%, p = 0.88)
  • within noise
    • bgra×4_to_rgb×4 (by ref)/loop (+2.62%)
  • improved
    • bgra×4_to_rgb×4 (by move)/chunks (-14.7%)
    • bgra×4_to_rgb×4 (by move)/raw (-13.42%)
    • bgra×4_to_rgb×4 (by move)/swizzle (-29.1%)
    • bgra×4_to_rgb×4 (by move)/shift (-18.3%)
    • bgra×4_to_rgb×4 (by ref)/scatter (-9.88%)
    • bgra×4_to_rgb×4 (by ref)/swizzle (-39.6%)
    • bgra×4_to_rgb×4 (by ref)/shift (-18.9%)
  • regressed (?)
    • bgra×4_to_rgb×4 (by move)/overwrite (+11.0%)
    • bgra×4_to_rgb×4 (by ref)/chunks (+7.59%)

rustc 1.58.0-nightly (6d246f0c8 2021-11-26)
AMD Ryzen 7 3700X 8-Core Processor 3.59 GHz
32 GB RAM
Microsoft Windows Version 21H1 (OS Build 19043.1348)


Quick observations:

  • chunks and raw are basically equivalent
  • loop, overwrite, and scatter are the poorer options
  • shift and swizzle trade blows for the fastest
  • something interesting is going on with my test, because two options regressed with native CPU target
  • it's surprisingly difficult to put a quick summary of criterion data into a Discourse post
  • don't trust me; I'm not a statistician
  • do your own benchmarks
  • this microbenchmark is on the order of single-digit cycles at the low end, so likely extremely noisy (and very unlikely to ever matter)
4 Likes

Wow, thanks for doing all that!

I learned something today!

 example::convert_via_overwriting:
          sub     rsp, 16
          mov     eax, dword ptr [rdi + 12]
-         bswap   eax
-         mov     dword ptr [rsp + 9], eax
-         mov     eax, dword ptr [rdi + 8]
+         mov     ecx, dword ptr [rdi + 8]
-         bswap   eax
-         mov     dword ptr [rsp + 6], eax
-         mov     eax, dword ptr [rdi + 4]
+         mov     edx, dword ptr [rdi + 4]
-         bswap   eax
-         mov     dword ptr [rsp + 3], eax
+         movbe   dword ptr [rsp + 9], eax
+         movbe   dword ptr [rsp + 6], ecx
-         mov     eax, dword ptr [rdi]
+         mov     ecx, dword ptr [rdi]
-         bswap   eax
+         movbe   dword ptr [rsp + 3], edx
-         mov     dword ptr [rsp], eax
+         movbe   dword ptr [rsp], ecx
          mov     edx, dword ptr [rsp + 9]
          mov     rax, qword ptr [rsp + 1]
          add     rsp, 16
          ret 

Turns out there's a move-with-byte-swap instruction under the CPUID flag MOVBE. I'm surprised it's slower, though. I'd have expected it to decode into the same or better micro-code, and thus be basically as fast as bswap+mov. This stack overflow answer says it's one μOp, though, which ought to be faster (assuming the uncited source is correct). But that's the difference between them, at least.

Well, colour me impressed. I would never have guessed that the 15 shuffle instructions in the non-native version would do well. Kudos to LLVM for knowing what to output for it, and to the chip designers for black magic.

I guess this, combined with the previous, really makes a great demonstration of just how cheap arithmetic and bitops and such are in our modern huge desktop chips. If I had to guess, overwrite's Achilles heel is that it cares about memory -- the stores to stack probably block each other, since they overlap. And I should have known that there's nothing worse you can do to a processor than make it wait for memory, even stack memory that's in L1 cache.

3 Likes

I'd guess that the overlapping write version is slower cuz it is probably blocking store -> load forwarding, so the cpu has to write it out completely to the l1 cache before it can read it again, rather than just reading directly from the queued stores.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.