Execution time depends on unused function

Hi everyone,

I'm am currently working on rewriting an old c program, which calculates the Gauss-Seidel and Jacobi algorithms, to Rust. This worked actually pretty well, but I ran into some confusing behavior regarding the execution time.

I reduced my program to something like this: (full source code)

fn main(){
    if true {
        println!("HERE");
        calculate(&..);
    }else{
        println!("THERE");
        calculate_parallel(&..);
    }
}

normally the program ran in ~22 sec for my default parameters, but with the parallel method in place, the execution time is doubled even though the function is not being executed (the condition is true all the time). This behavior is resolved when the parallel function is commented out.

I tested this on Windows and Linux (both on stable), with the same results. Somehow it is connected to the compiler optimization since the runtime doesn't double without --release. Manual inlining didn't change that.

I tested this with other functions, but there was no behavior like this. So I don't really understand what the compiler is doing with my code, and how to resolve this.

Any suggestion will be appreciated

What happens if you put unreachable!() into the false arm?

I’ve seen changes like this (not necessarily in Rust) make a difference either because the optimizer is thrown off and does something suboptimal or the code layout change affects memory addresses such that some stuff (instructions and/or data) alias in the CPU cache and cache hit rate goes down.

Looking at the assembly for two cases may suggest what the issue is.

1 Like

Wow, this actually worked. The execution time is normal with unreachable!() ...

Not too surprising :slight_smile: - the unreachable!() is similar to commenting out the code. I just wanted you to try it for sanity checking purposes.

Another thing you can try is to mark calculate and calculate_parallel with #[inline(never)], and remove the unreachable (and leave calculate_parallel uncommented).

The better way to analyze this, as mentioned, is to look at the assembly for both cases. But the above outlining is an interesting thing to try.

I already tried #[inline(never)] but this had no effect. The assembly code is an unreadable beast of code even with #[no_mangel] this will take a while to go through, but thanks.

Can you paste (or link to a gist) the assembly for the slow and fast versions (both in release/optimized builds, of course)? I've no doubt it's going to be a mess to eyeball but there might be some noticeable differences without necessarily following the entire thing.

You can also run both versions under perf (if you're on Linux) and see what it reports.

Ok,

So the assembler code is here

The code of the faster version is half the size of the slower version. (7k to 20k), which would explain that the slower version needs twice as much time. This can also be seen in the perf stats.

My understanding so far is that the calculate method needs to be inlined to be able to be fast, with #[inline(never)] or #[no_mangle] it will also be slow. On the other hand, #[inline(never)] or #[inline(always)] doesn't affect the parallel method and the execution time of the serial method is still slower than without calling the method in the else block.

But I didn't find a way to call calculate_paralle while maintaining the execution time of the serial method. Even using two different matrices for calculate and calculate_parallel doesn't help.

So calculate needs to be inlined into main to be fast? Interesting - implies that compiler is able to propagate some knowledge from main into that function (and whatever it inlines there) that greatly reduces code.

Have you tried putting #[inline(always)] on calculate?

jep, I did, but if I uncomment calculate_parallel then it is back to the same old slow behavior.

My assumption that calculate is inlined is based the fact that the execution time was slow when I compiled it with #[inline(never)]

So I actually found the line of code which is the root this problem. In calculate_paralle (
calculate::parallel::calc_get_with_threadpool in my original code) I am initializing another matrix (parallel.rs:376) for some inefficient clone operation. Commenting out this line and all lines which are operating on the new matrix, the program runs fast again but returns a wrong result. Not sure if this information is useful in some kind but it 's there now.

Does anyone other than me think that it would be a good idea to bring this discussion to the internals forum or to the rustc issue tracker?

Yes - I think a github issue might be warranted. It might be some LLVM problem.

@Gold3nCooki3, have you tried this on the nightly compiler? I'm assuming you've been using the stable one thus far.

1 Like

Yes, I tried it on nightly, with the same result.