So I have an existing project which uses some C code to write a u16 array to a u8 buffer for output with an endianess transform and I've been trying to go as fast in Rust.
My first attempt would not quite perform as fast as the C code:
pub fn flatten_u16_as_u16_safe(image: *const u16, output: *mut u8, pixels: usize) {
let image = unsafe { std::slice::from_raw_parts(image, pixels) };
let output_u8 = unsafe { std::slice::from_raw_parts_mut(output, pixels * 2) };
for (pixel, output_bytes) in image.into_iter().zip(output_u8.chunks_exact_mut(2)) {
let pixel_bytes = pixel.to_be_bytes();
output_bytes[0] = pixel_bytes[0];
output_bytes[1] = pixel_bytes[1];
}
}
But if called from Rust could be completely safe.
The C version does what I know now is an unsafe pointer cast from u8 to u16 and if I copy this in rust it generates the same basic assembly which looks pretty optimal. The rust code is:
pub fn flatten_u16_as_u16(image: *const u16, output: *mut u8, pixels: usize) {
let image = unsafe { std::slice::from_raw_parts(image, pixels) };
// The output as *mut u16 below is UB though I've seen no penalty on x64
let output_u16 = unsafe {std::slice::from_raw_parts_mut(output as *mut u16, pixels)};
for (in_pixel, out_pixel) in image.into_iter().zip(output_u16.iter_mut()) {
*out_pixel = in_pixel.swap_bytes()
}
}
This produces the ideal vectorised loop code of basically:
- Read into vector register.
- Shuffle in register (for endianess switch)
- Write out to memory.
But the safe version seems to generate extra code which extracts the u8 elements from the u16 and writes them independently which I guess is closer to the code semantics but around 50%-100% slower.
So the challenge is, can I achieve the same performance without the UB pointer cast to u16?
Other attempt was to use std::io::Write and a cursor but this was significantly slower.
EDIT: Here is a godbolt link with the different attempts Compiler Explorer