How to profile CPU time by function?

I have a long-running numerical simulation that I'm trying to speed up. I don't have much experience with profiling tools, but it appears that "perf" is popular for profiling CPU usage with linux.

Using "perf record my_rust_binary" and then "perf report", I am able to get some measurements for my program, but they're useless. It just says that 99.99% of the time is spent in "main". There's no breakdown for any of the functions that actually do the work.

What do I need to do to get a profile to show that in the following code, "g1" is taking 10% of the CPU time and "g2" is taking 90% of the CPU time?

fn main() {
    println!("{}", f());
}

fn f() -> f64 {
    (0..10000).into_iter().map(|i| if i%2 == 0 { g1(i) } else { g2(i) }).sum()
}

fn g1(i:usize) -> f64 {
    let mut a = [i as f64; 100];
    for _ in 0..1000 {
        a.iter_mut().for_each(|x| *x = *x / 1.1);
    }
    a.iter().sum()
}

fn g2(i:usize) -> f64 {
    let mut a = [i as f64; 100];
    for _ in 0..9000 {
        a.iter_mut().for_each(|x| *x = *x / 1.1);
    }
    a.iter().sum()
}

This implies that all other functions were either inlined, or just did not take measurable time.

You can prevent inlining on specific functions with the #[inline(never)] attribute. This just prevents that particular function from being inlined into its callers, but other functions may still be inlined into that one.

Also, look at flame and flamer, which I discovered recently here.

https://github.com/TyOverby/flame

https://github.com/llogiq/flamer

2 Likes

Thanks. Using #[inline(never)] gets perf to pick up both g1 and g2.

Is there a way to disable inlining for an entire crate? I have ~100 functions that I would want to be considered when profiling. I don't want to add and remove 100 annotations every time I want to profile..

As a sledgehammer, you can disable all inlining by running a debug build :slight_smile: But it comes with the same drawbacks as running a debug build in general: the performance that you get is not at all representative of your application's actual performance profile.

Yeah, I would need to still use a release build. Many of the linear algebra calculations need optimizations enabled to get reasonable performance.

You could make it a Cargo.toml feature, then annotate each with:
#[cfg_attr(feature = "noinline", inline(never))]

If you build in release mode with debuginfo, I think perf callgraphs can still report inlined calls.

1 Like

By the way, if you have anything like these loops that divide each element of an array by a constant number in your program, you may want to precompute the inverse (1/1.1) and multiply by that instead of doing a division on each loop iteration. Two reasons:

  • Divides are much slower than multiplies on modern CPUs. A typical CPU can run 1 or 2 multiply instructions in a single clock cycle, whereas it takes of the order of ~10 clock cycles to run a divide instruction. This is linked to the fact that divisions are harder to parallelize.
  • Modern CPUs have fused multiply-add instructions, which give you one "free" addition for the price of one multiplication in suitable code (like, say, a dot product). You do not get that with divisions.

Of course, in your specific case, there are even more optimizations that can be carried out (precomputing (1/1.1)^1000 and (1/1.1)^9000 instead of multiplying/dividing in a loop for example), but these are likely to be linked to the artificial nature of the benchmark.