Prevent Rust from optimizing my code away!

Hey everyone!

I could need some help in preventing Rust from optimizing away the function call I want to measure CPU clock cycles on. I have the following code to measure the number of clock cycles it takes to perform a single round of AES encryption:

#[inline]
unsafe fn measure_cycles() -> __m128i {
    let mut state = core::arch::x86_64::_mm_set1_epi8(1);
    let key = [1u8; 16];
    let roundkeys = expand_key(&key);

    state = _mm_xor_si128(state, roundkeys[0]);
    
    // This is the important part
    let pre = rdtsc();
    for _i in 1..100_000 {
        // Rust keeps optimizing this call away
        state = core::arch::x86_64::_mm_aesenc_si128(state, roundkeys[1]);
    }
    let post = rdtsc();
    // Important part ends here
    println!("{}", (post - pre) / 100_000);

    state
}

fn main() {
    unsafe {
        let state = measure_cycles();
        println!("{:#?}", state);
    }
}

However, when actually executing the code under the release profile Rust blatantly keeps optimizing away the call to core::arch::x86_64::_mm_aesenc_si128. I know, because I looked at the generated assembly and compared debug to release. Debug shows the call, while release simply optimizes it away. I even tried using std::hint::black_box but it did not help at all. So I am kinda at my wits end here.

Any hint or idea is appreciated!

What happens if you try to print the state after the loop? E.g.:

println!("{:?}", state);

Sorry, should have probably included main() as well, because I am already doing that with the return value, where the function is called in main().

You could just ask Intel:

image

8 Likes

If the compiler is pre-computing your AES results, it might be that you're black_boxing the output of the algorithm, but not the input:

state = core::arch::x86_64::_mm_aesenc_si128(std::hint::black_box(state), roundkeys[1]);
                                             +++++++++++++++++++++     +
1 Like

I tried this as well, same results. :confused:

1 Like

Well that makes my work a lot easier. However, the curiosity in me still wants an answer on how to prevent Rust from optimizing this away.

Instead of using the core::arch::x86_64::_mm_aesenc_si128() intrinsic, could you use inline assembly to execute the instruction?

LLVM understands the intrinsic functions because they correspond to LLVM operations, whereas inline assembly is relatively opaque to the optimiser.

7 Likes

I will try it out!

This did the trick. Thank you very much!

3 Likes

Maybe you need to black_box() twice, the argument (to prevent the compiler from constant-folding your code) and the result (to prevent it from DCE'ing it)?

To ensure a piece of code is not optimized away, it must receive at run-time some value unknown to the compiler, and it must emit a value depending on that received value. For example, get a value from the command line, use that value in your code to generate another value, and use that generated value as exit status of the program. In case your code is a loop, any of its iterations should process a different area of memory, or the output of any iteration should be passed as input for the next iteration.

2 Likes