I have a long-running numerical simulation that I'm trying to speed up. I don't have much experience with profiling tools, but it appears that "perf" is popular for profiling CPU usage with linux.
Using "perf record my_rust_binary" and then "perf report", I am able to get some measurements for my program, but they're useless. It just says that 99.99% of the time is spent in "main". There's no breakdown for any of the functions that actually do the work.
What do I need to do to get a profile to show that in the following code, "g1" is taking 10% of the CPU time and "g2" is taking 90% of the CPU time?
fn main() {
println!("{}", f());
}
fn f() -> f64 {
(0..10000).into_iter().map(|i| if i%2 == 0 { g1(i) } else { g2(i) }).sum()
}
fn g1(i:usize) -> f64 {
let mut a = [i as f64; 100];
for _ in 0..1000 {
a.iter_mut().for_each(|x| *x = *x / 1.1);
}
a.iter().sum()
}
fn g2(i:usize) -> f64 {
let mut a = [i as f64; 100];
for _ in 0..9000 {
a.iter_mut().for_each(|x| *x = *x / 1.1);
}
a.iter().sum()
}
This implies that all other functions were either inlined, or just did not take measurable time.
You can prevent inlining on specific functions with the #[inline(never)]
attribute. This just prevents that particular function from being inlined into its callers, but other functions may still be inlined into that one.
Also, look at flame and flamer, which I discovered recently here.
https://github.com/TyOverby/flame
https://github.com/llogiq/flamer
2 Likes
Thanks. Using #[inline(never)]
gets perf to pick up both g1 and g2.
Is there a way to disable inlining for an entire crate? I have ~100 functions that I would want to be considered when profiling. I don't want to add and remove 100 annotations every time I want to profile..
As a sledgehammer, you can disable all inlining by running a debug build But it comes with the same drawbacks as running a debug build in general: the performance that you get is not at all representative of your application's actual performance profile.
Yeah, I would need to still use a release build. Many of the linear algebra calculations need optimizations enabled to get reasonable performance.
You could make it a Cargo.toml
feature, then annotate each with:
#[cfg_attr(feature = "noinline", inline(never))]
If you build in release mode with debuginfo, I think perf
callgraphs can still report inlined calls.
1 Like
By the way, if you have anything like these loops that divide each element of an array by a constant number in your program, you may want to precompute the inverse (1/1.1) and multiply by that instead of doing a division on each loop iteration. Two reasons:
- Divides are much slower than multiplies on modern CPUs. A typical CPU can run 1 or 2 multiply instructions in a single clock cycle, whereas it takes of the order of ~10 clock cycles to run a divide instruction. This is linked to the fact that divisions are harder to parallelize.
- Modern CPUs have fused multiply-add instructions, which give you one "free" addition for the price of one multiplication in suitable code (like, say, a dot product). You do not get that with divisions.
Of course, in your specific case, there are even more optimizations that can be carried out (precomputing (1/1.1)^1000 and (1/1.1)^9000 instead of multiplying/dividing in a loop for example), but these are likely to be linked to the artificial nature of the benchmark.