Zero cost abstractions are cool and I wanted to strengthen my faith in them, so I started using the excellent cargo-asm to generate human readable asm for some code I was working on.
The code uses an external function call to initialize an array of u32s, ensures none of them are 0 and returns the values as NonZeroU32s. This involves some unsafe code and my goal was to make the rust code use as little unsafe tricks as possible while keeping the emitted asm as close to, what I perceive as, optimal. Disclaimer: I am very much a beginner in terms of understanding compiler optimizations and assembly so I will mostly observe and try to limit any judging.
First attempt
Supporting code
use std::num::NonZeroU32;
// Code from https://github.com/nvzqz/static-assertions-rs.
macro_rules! assert_eq_size {
($x:ty, $($xs:ty),+ $(,)*) => {
$(let _ = ::std::mem::transmute::<$x, $xs>;)+
};
($label:ident; $($xs:tt)+) => {
#[allow(dead_code, non_snake_case)]
fn $label() { assert_eq_size!($($xs)+); }
};
}
mod ffi {
extern "C" {
pub fn overwrite_values(len: i32, values: *mut u32);
}
}
#[inline]
fn overwrite_values(values: &mut [Option<NonZeroU32>]) {
assert_eq_size!(Option<NonZeroU32>, u32);
unsafe {
ffi::overwrite_values(values.len() as i32, values.as_mut_ptr() as *mut u32);
}
}
The function of interest in rust:
pub unsafe fn create_values_transmute() -> [NonZeroU32; 2] {
// Reserve some unallocated memory on the stack.
let mut values: [Option<NonZeroU32>; 2] = ::std::mem::uninitialized();
// Overwrite the values with unknown new values.
overwrite_values(&mut values);
// Ensure all values are Some.
for value in values.iter() {
assert!(value.is_some());
}
// Strip away the Option without copying.
::std::mem::transmute::<[Option<NonZeroU32>; 2], [NonZeroU32; 2]>(values)
}
The function of interest in assembly:
experiments_rust::create_values_transmute:
push rax
mov rsi, rsp
mov edi, 2
call overwrite_values
cmp dword, ptr, [rsp], 0
je .LBB14_2
mov eax, dword, ptr, [rsp, +, 4]
test eax, eax
je .LBB14_2
mov rax, qword, ptr, [rsp]
pop rcx
ret
.LBB14_2:
lea rdi, [rip, +, .Lanon.999ddb7825e3783297e13721a9ed219b.22]
lea rdx, [rip, +, .Lanon.999ddb7825e3783297e13721a9ed219b.21]
mov esi, 33
call std::panicking::begin_panic
ud2
Now, something odd is happening here. The Some checking loop is unrolled, but the first value is compared with one instruction, not using a register
cmp dword, ptr, [rsp], 0
but the second value is compared with two instructions
mov eax, dword, ptr, [rsp, +, 4]
test eax, eax
It looks like the returned value is taken from the stack and placed in the rax register (mov rax, qword, ptr, [rsp]). With a larger array size this will most likely no longer work.
Manually unrolling the loop
What happens if we manually unroll the loop?
// Ensure all values are Some.
values[0].as_ref().unwrap();
values[1].as_ref().unwrap();
full rust
pub unsafe fn create_values_transmute_unrolled() -> [NonZeroU32; 2] {
// Reserve some unallocated memory on the stack.
let mut values: [Option<NonZeroU32>; 2] = ::std::mem::uninitialized();
// Overwrite the values with unknown new values.
overwrite_values(&mut values);
// Ensure all values are Some.
values[0].as_ref().unwrap();
values[1].as_ref().unwrap();
// Strip away the Option without copying.
::std::mem::transmute::<[Option<NonZeroU32>; 2], [NonZeroU32; 2]>(values)
}
The assembly:
cmp dword, ptr, [rsp], 0
je .LBB14_3
cmp dword, ptr, [rsp, +, 4], 0
je .LBB14_3
full asm
experiments_rust::thingy::create_values_transmute_unrolled:
push rax
mov rsi, rsp
mov edi, 2
call overwrite_values
cmp dword, ptr, [rsp], 0
je .LBB14_3
cmp dword, ptr, [rsp, +, 4], 0
je .LBB14_3
mov rax, qword, ptr, [rsp]
pop rcx
ret
.LBB14_3:
lea rdi, [rip, +, .Lanon.2412bd351f75f47ca0d5104005dbeb29.2]
call core::panicking::panic
ud2
This time, the single instruction comparison with 0 is performed for the second value as well.
Getting rid of transmute
Because ::std::mem::transmute is so powerful, it is easy to mess something up. We can try to get rid of it by reconstructing the array from the individual values.
// Unwrap the options, hopefully it gets optimized and happens in place.
[
values[0].take().unwrap(),
values[1].take().unwrap(),
]
full rust
pub unsafe fn create_values_take() -> [NonZeroU32; 2] {
// Reserve some unallocated memory on the stack.
let mut values: [Option<NonZeroU32>; 2] = ::std::mem::uninitialized();
// Overwrite the values with unknown new values.
overwrite_values(&mut values);
// Unwrap the options, hopefully it gets optimized and happens in place.
[
values[0].take().unwrap(),
values[1].take().unwrap(),
]
}
The assembly:
mov ecx, dword, ptr, [rsp]
mov dword, ptr, [rsp], 0
test rcx, rcx
je .LBB14_3
mov eax, dword, ptr, [rsp, +, 4]
mov dword, ptr, [rsp, +, 4], 0
test rax, rax
je .LBB14_3
full asm
experiments_rust::thingy::create_values_take:
push rax
mov rsi, rsp
mov edi, 2
call overwrite_values
mov ecx, dword, ptr, [rsp]
mov dword, ptr, [rsp], 0
test rcx, rcx
je .LBB14_3
mov eax, dword, ptr, [rsp, +, 4]
mov dword, ptr, [rsp, +, 4], 0
test rax, rax
je .LBB14_3
shl rax, 32
or rax, rcx
pop rcx
ret
.LBB14_3:
lea rdi, [rip, +, .Lanon.19f1b27cb02834246aaa2ebed86c020d.2]
call core::panicking::panic
ud2
Unfortunately, using take means a None gets written to the stack. It does not look like the values on the stack will be accessed in the non-panic case, since the return value is constructed from the registers ecx and eax. In case of a panic though, the compiler is unable to prove that the stack is not accessed which means that it has to set the values to 0. (Please correct me if I'm wrong about any of this!)
Getting rid of take
Instead of using Option::take we can use std::mem::replace in conjunction with std::mem::uninitialized to try and get the compiler to not write a 0 back to the stack.
// Unwrap the options, hopefully it gets optimized and happens in place.
[
::std::mem::replace(&mut values[0], ::std::mem::uninitialized()).unwrap(),
::std::mem::replace(&mut values[1], ::std::mem::uninitialized()).unwrap(),
]
full rust
pub unsafe fn create_values_replace() -> [NonZeroU32; 2] {
// Reserve some unallocated memory on the stack.
let mut values: [Option<NonZeroU32>; 2] = ::std::mem::uninitialized();
// Overwrite the values with unknown new values.
overwrite_values(&mut values);
// Unwrap the options, hopefully it gets optimized and happens in place.
[
::std::mem::replace(&mut values[0], ::std::mem::uninitialized()).unwrap(),
::std::mem::replace(&mut values[1], ::std::mem::uninitialized()).unwrap(),
]
}
The assembly:
experiments_rust::thingy::create_values_replace:
jmp _ZN16experiments_rust6thingy18create_values_take17h47d8bdb4a7c32c3cE@PLT
Wait, what?! It simply calls the take version! What if we compile without create_values_take?
mov ecx, dword, ptr, [rsp]
test rcx, rcx
je .LBB14_3
mov eax, dword, ptr, [rsp, +, 4]
test rax, rax
je .LBB14_3
full asm
experiments_rust::thingy::create_values_replace:
push rax
mov rsi, rsp
mov edi, 2
call overwrite_values
mov ecx, dword, ptr, [rsp]
test rcx, rcx
je .LBB14_3
mov eax, dword, ptr, [rsp, +, 4]
test rax, rax
je .LBB14_3
shl rax, 32
or rax, rcx
pop rcx
ret
.LBB14_3:
lea rdi, [rip, +, .Lanon.c043895d4cb9cfccd6c56ffc4fe98f75.2]
call core::panicking::panic
ud2
Woah it worked and woah that is scary! The presence of the take version affects whether or not the replace version will do what you think it will!
Conclusion
All code together: rust.godbolt.org.
-
The register comparison with
mov,testand the stack comparison withcmpprovide equivalent functionality. We even get a mixed result with thecreate_values_transmuteversion. Why isn't one preferred over the other? Isn't it strange the the manual unrolling of the 2 item loop yields a different result? -
I'm surprised the compiler detects that
create_values_replacecan simply callcreate_values_take. It adds twomovinstructions that are not necessary to reduce code size. It is a good reminder::std::mem::uninitialized()means the compiler is free to write a value there, even if you are not. -
Some other time I will investigate what happens when the return value can't be fit in registers.