Can I make this high speed output safe?

So I have an existing project which uses some C code to write a u16 array to a u8 buffer for output with an endianess transform and I've been trying to go as fast in Rust.

My first attempt would not quite perform as fast as the C code:

pub fn flatten_u16_as_u16_safe(image: *const u16, output: *mut u8, pixels: usize) {
    let image = unsafe { std::slice::from_raw_parts(image, pixels) };
    let output_u8 = unsafe { std::slice::from_raw_parts_mut(output, pixels * 2) };

    for (pixel, output_bytes) in image.into_iter().zip(output_u8.chunks_exact_mut(2)) {
        let pixel_bytes = pixel.to_be_bytes();
        output_bytes[0] = pixel_bytes[0];
        output_bytes[1] = pixel_bytes[1];
    }
}

But if called from Rust could be completely safe.

The C version does what I know now is an unsafe pointer cast from u8 to u16 and if I copy this in rust it generates the same basic assembly which looks pretty optimal. The rust code is:

pub fn flatten_u16_as_u16(image: *const u16, output: *mut u8, pixels: usize) {
    let image = unsafe { std::slice::from_raw_parts(image, pixels) };
    // The output as *mut u16 below is UB though I've seen no penalty on x64
    let output_u16 = unsafe {std::slice::from_raw_parts_mut(output as *mut u16, pixels)};

    for (in_pixel, out_pixel) in image.into_iter().zip(output_u16.iter_mut()) {
        *out_pixel = in_pixel.swap_bytes()
    }
}

This produces the ideal vectorised loop code of basically:

  • Read into vector register.
  • Shuffle in register (for endianess switch)
  • Write out to memory.

But the safe version seems to generate extra code which extracts the u8 elements from the u16 and writes them independently which I guess is closer to the code semantics but around 50%-100% slower.

So the challenge is, can I achieve the same performance without the UB pointer cast to u16?

Other attempt was to use std::io::Write and a cursor but this was significantly slower.

EDIT: Here is a godbolt link with the different attempts Compiler Explorer

How is output allocated? Do you have anything to give it a larger alignment?

Your methods take raw pointers, so I could pass a dangling or null pointer to them, which would not be safe.

2 Likes

Replacing the body of the first for loop with output_bytes.copy_from_slice(&pixel.to_be_bytes()); seems to generate pretty much the same assembly Compiler Explorer

It's generally considered bad practice to make a safe function that can produce UB if called with the wrong arguments (in this case invalid pointers). You should either mark it as unsafe (and document the safety preconditions and why the internal unsafe operations are justified), or change it to actually be safe by taking slice references in input. In the latter case you can add an assert at the start to check that the length are the expected ones and that knowledge will be used by LLVM to optimize your code Compiler Explorer

Finally, note that swap_bytes will produce Little Endian bytes if called on a Big Endian platform. If you just want Big Endian bytes in any case then .to_be() will be more explicit.

7 Likes

Ah yes great question. Right now I can't control it as it is called as a plugin. Hopefully I can get more written in Rust and then I can drop the whole pointer interface for Rust and then I probably can.

Then I guess I can use the align_to function to convert which I've used elsewhere.

Also yes probably should have addressed the null pointer safety too!

This looks like the solution and copy_from_slice is the gap in my knowledge I've needed a few times.

I'll benchmark it later but the assembly looks like the optimal version I expected.

Yep perfect - exact performance match for the C implementation with added safety if I can convert the callers to Rust as well!

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.