Decompose Slice Efficiently

Hello,
I'm currently trying to convert YUYV to YUV video by decomposing the slice from v4l2.
The &[u8] that I get is composed of 4-Byte-Blocks that are encoded like YUYV.
I want to split this slice into three (one for Ys, one for Us, one for Vs).
Currently, I'm doing:

fn main() {
  let buf: &[u8] = todo!();
  let y = buf.iter().step_by(2).copied().collect::<Vec<_>>();
  let u = buf.iter().skip(1).step_by(4).copied().collect::<Vec<_>>();
  let v = buf.iter().skip(3).step_by(4).copied().collect::<Vec<_>>();
  h264_encode(&y, &u, &v);
}

//Fixed, can't be changed (as it's part of a library)
fn h264_encode(y: &[u8], u: &[u8], v: &[u8]) {
  todo!()
}

As this code is executed on every frame, I'm wondering if there is a way I can improve the performance of this snippet.
Do you have an Idea to improve the performance of this snippet?

Thanks
pfzetto

Have you benchmarked that your code is too slow? The cost to copy ~4MB
(assuming a full HD camera) should be negligible.

You might be interested in a library that abstracts this. The ndarray crate for example allows you to view a 2d-array as it's transpose: ArrayBase in ndarray - Rust

You'll need to benchmark it, but you might get better performance by iterating through buf only once instead of 3 times:

  assert!(buf.len() % 4 == 0);
  let mut yy = Vec::with_capacity(buf.len() / 2);
  let mut uu = Vec::with_capacity(buf.len() / 4);
  let mut vv = Vec::with_capacity(buf.len() / 4);

  // If the buffer length is always the same, the lines above can be done
  // during setup to avoid allocations during the hot loop.  

  let mut chunks = buf.chunks_exact(4);
  while let Some(&[y1, u, y2, v]) = chunks.next() {
      yy.push(y1);
      yy.push(y2);
      uu.push(u);
      vv.push(v);
  }
  
  h264_encode(&yy, &uu, &vv);
1 Like

Can also use nightly's array_chunks to make the match infallible.

1 Like

Rust doesn't support non-contiguous arrays, at least at this currently. This means that there is no performant way to do what you want. Either you eat the cost of copies, or, if you really need performance, you write your encoding function as a single pass over the initial slice, with explicit indexing. Ideally you write the loop via simd intrinsics, since autovectorization is usually not good enough for performance-critical number-crunching loops.

1 Like

You can try reusing allocations if you are processing more than one frame.

Another unbenchmarked idea, use slices to preallocated memory.

    use std::iter::zip;

    let mut frame = vec![0; buf.len()];
    let (yy, uv) = frame.split_at_mut(buf.len() / 2);
    let (u, v) = uv.split_at_mut(buf.len() / 4);

    let iter = zip(
        zip(buf.chunks_exact(4), yy.chunks_exact_mut(2)),
        zip(u.iter_mut(), v.iter_mut()),
    );

    for ((src, yy), (uu, vv)) in iter {
        let &[y1, y2, u, v] = src else { unreachable!() };
        yy[0] = y1;
        yy[1] = y2;
        *uu = u;
        *vv = v;
    }

    h264_encode(yy, u, v);

This would fit nicely with the "reuse allocations" approach @manpacket mentioned (the function body above would take a &mut [_] or three, or a &mut Vec<_> if the frames aren't a consistent size).

If you have mutable access to the buffer, you could transpose it in place from YUYV… to Y…U…V…. And even if not, a small optimization would be to just allocate a single vector that you append Ys, Us, and Vs to and then slice it into three parts. Also, chunks_exact is usually faster than chunks.

1 Like

One thing to try, independent from optimizing the copy itself, is to use multiple threads; one thread does the data copying, and the other thread calls h264_encode(). This way the work is “pipelined”: the copies for frame N can happen while the encoding of frame N-1 is happening on another core. You would use std::sync::mpsc channels to transport the prepared Y,U,V vectors from one thread to the next, and perhaps also to circulate them back to the beginning to reuse the allocations.

This is a very general technique for improving the performance of any workload made up of multiple tasks to be performed to a sequence of inputs — even if throughput is definitely limited by a single step that can't be improved (e.g. the encoder in this case), you gain a little performance by never having that thread spend time doing things that isn't running that step.

3 Likes

LLVM can vectorize perfect unshuffle (compiler explorer). In this case
you can chunk the input into tetrad of 4 vectors, then you unshuffle each tetrad from YUYV to YYUV and then unshuffle the UV portion. That is, roughly:

let (y0, uv0) = unshuffle(&input0, &input1);
let (y1, uv1) = unshuffle(&input2, &input3);
let (u, v) = unshuffle(&uv0, &uv1);
1 Like