Profilers and how to interpret results on recursive functions

If you don't recurse too deeply and have a new enough CPU + Linux kernel combination on your hands, here's how I would have a go at analyzing this using my current favorite performance analysis toy, Linux perf:

  • Install the perf profiler (package name varies by distro, e.g. "perf" or "linux-tools-common").
  • Tell cargo that we're going to need debugging symbols. There are various ways, a minimally invasive ones is to run export RUSTFLAGS='-g' in the shell which you are going to use in the remainder of this exercise.
  • Compile your example with those debugging symbols: cargo build --release
  • Profile with LBR[1] call graph: perf record --call-graph=lbr cargo run --release. You don't need to let the example run for minutes, profiling a few seconds of execution is sufficient for basic analysis and perf will record the data even if you kill the application with Ctl+C before the end.
  • Start the profile analysis UI: perf report

The perf report uses a fancy ncurses-based UI that works best on large terminals, so better maximize your terminal window. On the left of the main UI view there are two columns, the first one is the time fraction spent in a function or one of the function that it calls ("children time"), and the second one is the time fraction spent in the function itself ("self time"). The remaining columns identify the function that one is talking about.

In general, a good first approach is to look at functions from top to bottom (i.e. in order of decreasing children time), starting from the first function that you wrote (i.e. ignoring the Rust implementation scaffolding) and paying special attention to functions with high self time (which consume a lot of CPU time on their own).

By pressing Return when a specific function is highlighted, you can also get the fraction of time spent in its various children, which can provide additional insight, particularly when you profile complex applications in which a single function gets called in many different circumstances.

By pressing "a", you will get assembly annotated with associated source code, and will be able to cycle through asm instructions in order of decreasing hotness (sample count) by pressing TAB. When you use this mode, it's easy to end up with statistically insignificant results, so be sure to measure program executions over longer periods of time and have a look at the raw sample count with "t".

That's just a perf profiling crash course, and I'm only scratching the surface of what can be done, but hopefully it should enough to get you going. For more, check out Brendan Gregg's awesome docs @ Linux perf Examples .


[1] A cool profiling technique that leverages the Last Branch Record performance counters of modern CPUs (> Haswell for Intel IIRC) in order to acquire call graphs. Extremely low overhead, allowing sampling at 10kHz without a significant performance impact on the application. Small profile files that get analyzed quickly. However, the depth of the call graph is limited, hence my mention of recursion depth as an important constraint. If that ends up being a problem, try DWARF call graphs.

15 Likes