Zero cost abstractions are cool and I wanted to strengthen my faith in them, so I started using the excellent cargo-asm to generate human readable asm for some code I was working on.
The code uses an external function call to initialize an array of u32
s, ensures none of them are 0
and returns the values as NonZeroU32
s. This involves some unsafe
code and my goal was to make the rust code use as little unsafe
tricks as possible while keeping the emitted asm
as close to, what I perceive as, optimal. Disclaimer: I am very much a beginner in terms of understanding compiler optimizations and assembly so I will mostly observe and try to limit any judging.
First attempt
Supporting code
use std::num::NonZeroU32;
// Code from https://github.com/nvzqz/static-assertions-rs.
macro_rules! assert_eq_size {
($x:ty, $($xs:ty),+ $(,)*) => {
$(let _ = ::std::mem::transmute::<$x, $xs>;)+
};
($label:ident; $($xs:tt)+) => {
#[allow(dead_code, non_snake_case)]
fn $label() { assert_eq_size!($($xs)+); }
};
}
mod ffi {
extern "C" {
pub fn overwrite_values(len: i32, values: *mut u32);
}
}
#[inline]
fn overwrite_values(values: &mut [Option<NonZeroU32>]) {
assert_eq_size!(Option<NonZeroU32>, u32);
unsafe {
ffi::overwrite_values(values.len() as i32, values.as_mut_ptr() as *mut u32);
}
}
The function of interest in rust:
pub unsafe fn create_values_transmute() -> [NonZeroU32; 2] {
// Reserve some unallocated memory on the stack.
let mut values: [Option<NonZeroU32>; 2] = ::std::mem::uninitialized();
// Overwrite the values with unknown new values.
overwrite_values(&mut values);
// Ensure all values are Some.
for value in values.iter() {
assert!(value.is_some());
}
// Strip away the Option without copying.
::std::mem::transmute::<[Option<NonZeroU32>; 2], [NonZeroU32; 2]>(values)
}
The function of interest in assembly:
experiments_rust::create_values_transmute:
push rax
mov rsi, rsp
mov edi, 2
call overwrite_values
cmp dword, ptr, [rsp], 0
je .LBB14_2
mov eax, dword, ptr, [rsp, +, 4]
test eax, eax
je .LBB14_2
mov rax, qword, ptr, [rsp]
pop rcx
ret
.LBB14_2:
lea rdi, [rip, +, .Lanon.999ddb7825e3783297e13721a9ed219b.22]
lea rdx, [rip, +, .Lanon.999ddb7825e3783297e13721a9ed219b.21]
mov esi, 33
call std::panicking::begin_panic
ud2
Now, something odd is happening here. The Some
checking loop is unrolled, but the first value is compared with one instruction, not using a register
cmp dword, ptr, [rsp], 0
but the second value is compared with two instructions
mov eax, dword, ptr, [rsp, +, 4]
test eax, eax
It looks like the returned value is taken from the stack and placed in the rax
register (mov rax, qword, ptr, [rsp]
). With a larger array size this will most likely no longer work.
Manually unrolling the loop
What happens if we manually unroll the loop?
// Ensure all values are Some.
values[0].as_ref().unwrap();
values[1].as_ref().unwrap();
full rust
pub unsafe fn create_values_transmute_unrolled() -> [NonZeroU32; 2] {
// Reserve some unallocated memory on the stack.
let mut values: [Option<NonZeroU32>; 2] = ::std::mem::uninitialized();
// Overwrite the values with unknown new values.
overwrite_values(&mut values);
// Ensure all values are Some.
values[0].as_ref().unwrap();
values[1].as_ref().unwrap();
// Strip away the Option without copying.
::std::mem::transmute::<[Option<NonZeroU32>; 2], [NonZeroU32; 2]>(values)
}
The assembly:
cmp dword, ptr, [rsp], 0
je .LBB14_3
cmp dword, ptr, [rsp, +, 4], 0
je .LBB14_3
full asm
experiments_rust::thingy::create_values_transmute_unrolled:
push rax
mov rsi, rsp
mov edi, 2
call overwrite_values
cmp dword, ptr, [rsp], 0
je .LBB14_3
cmp dword, ptr, [rsp, +, 4], 0
je .LBB14_3
mov rax, qword, ptr, [rsp]
pop rcx
ret
.LBB14_3:
lea rdi, [rip, +, .Lanon.2412bd351f75f47ca0d5104005dbeb29.2]
call core::panicking::panic
ud2
This time, the single instruction comparison with 0 is performed for the second value as well.
Getting rid of transmute
Because ::std::mem::transmute
is so powerful, it is easy to mess something up. We can try to get rid of it by reconstructing the array from the individual values.
// Unwrap the options, hopefully it gets optimized and happens in place.
[
values[0].take().unwrap(),
values[1].take().unwrap(),
]
full rust
pub unsafe fn create_values_take() -> [NonZeroU32; 2] {
// Reserve some unallocated memory on the stack.
let mut values: [Option<NonZeroU32>; 2] = ::std::mem::uninitialized();
// Overwrite the values with unknown new values.
overwrite_values(&mut values);
// Unwrap the options, hopefully it gets optimized and happens in place.
[
values[0].take().unwrap(),
values[1].take().unwrap(),
]
}
The assembly:
mov ecx, dword, ptr, [rsp]
mov dword, ptr, [rsp], 0
test rcx, rcx
je .LBB14_3
mov eax, dword, ptr, [rsp, +, 4]
mov dword, ptr, [rsp, +, 4], 0
test rax, rax
je .LBB14_3
full asm
experiments_rust::thingy::create_values_take:
push rax
mov rsi, rsp
mov edi, 2
call overwrite_values
mov ecx, dword, ptr, [rsp]
mov dword, ptr, [rsp], 0
test rcx, rcx
je .LBB14_3
mov eax, dword, ptr, [rsp, +, 4]
mov dword, ptr, [rsp, +, 4], 0
test rax, rax
je .LBB14_3
shl rax, 32
or rax, rcx
pop rcx
ret
.LBB14_3:
lea rdi, [rip, +, .Lanon.19f1b27cb02834246aaa2ebed86c020d.2]
call core::panicking::panic
ud2
Unfortunately, using
take means a None gets written to the stack. It does not look like the values on the stack will be accessed in the non-panic case, since the return value is constructed from the registers ecx
and eax
. In case of a panic though, the compiler is unable to prove that the stack is not accessed which means that it has to set the values to 0. (Please correct me if I'm wrong about any of this!)
Getting rid of take
Instead of using Option::take
we can use std::mem::replace
in conjunction with std::mem::uninitialized
to try and get the compiler to not write a 0
back to the stack.
// Unwrap the options, hopefully it gets optimized and happens in place.
[
::std::mem::replace(&mut values[0], ::std::mem::uninitialized()).unwrap(),
::std::mem::replace(&mut values[1], ::std::mem::uninitialized()).unwrap(),
]
full rust
pub unsafe fn create_values_replace() -> [NonZeroU32; 2] {
// Reserve some unallocated memory on the stack.
let mut values: [Option<NonZeroU32>; 2] = ::std::mem::uninitialized();
// Overwrite the values with unknown new values.
overwrite_values(&mut values);
// Unwrap the options, hopefully it gets optimized and happens in place.
[
::std::mem::replace(&mut values[0], ::std::mem::uninitialized()).unwrap(),
::std::mem::replace(&mut values[1], ::std::mem::uninitialized()).unwrap(),
]
}
The assembly:
experiments_rust::thingy::create_values_replace:
jmp _ZN16experiments_rust6thingy18create_values_take17h47d8bdb4a7c32c3cE@PLT
Wait, what?! It simply calls the take
version! What if we compile without create_values_take
?
mov ecx, dword, ptr, [rsp]
test rcx, rcx
je .LBB14_3
mov eax, dword, ptr, [rsp, +, 4]
test rax, rax
je .LBB14_3
full asm
experiments_rust::thingy::create_values_replace:
push rax
mov rsi, rsp
mov edi, 2
call overwrite_values
mov ecx, dword, ptr, [rsp]
test rcx, rcx
je .LBB14_3
mov eax, dword, ptr, [rsp, +, 4]
test rax, rax
je .LBB14_3
shl rax, 32
or rax, rcx
pop rcx
ret
.LBB14_3:
lea rdi, [rip, +, .Lanon.c043895d4cb9cfccd6c56ffc4fe98f75.2]
call core::panicking::panic
ud2
Woah it worked and woah that is scary! The presence of the take
version affects whether or not the replace
version will do what you think it will!
Conclusion
All code together: rust.godbolt.org.
-
The register comparison with
mov
,test
and the stack comparison withcmp
provide equivalent functionality. We even get a mixed result with thecreate_values_transmute
version. Why isn't one preferred over the other? Isn't it strange the the manual unrolling of the 2 item loop yields a different result? -
I'm surprised the compiler detects that
create_values_replace
can simply callcreate_values_take
. It adds twomov
instructions that are not necessary to reduce code size. It is a good reminder::std::mem::uninitialized()
means the compiler is free to write a value there, even if you are not. -
Some other time I will investigate what happens when the return value can't be fit in registers.