Profilers and how to interpret results on recursive functions

wahn · October 6, 2017, 10:36am

Here is a concrete example where I would like to understand the output of various profilers.

A renderer normally has three phases:

You read a scene description via a parser
You try to accelerate mainly ray intersection calls by using acceleration structures like bounding volume hierarchies (BVH)
You do the rendering (ray tracing/path tracing) in parallel (multi-threaded)

While converting the C++ code from the PBRT book to Rust I didn't care too much about memory efficiency (so far). But I do get some decent results already. Right now the bottleneck for more complex scenes seem to be the recursive building of the BVH. As described here you can render one of the test scenes like this:

> ./target/release/examples/pest_test -i assets/scenes/conference_room.pbrt
...
WorldEnd
DEBUG: rr_threshold = 1
...

Once you see the lines above the BVH gets build (single threaded), and once the mutli-threading starts (watch the CPUs) the BVH is done and rendering starts. This takes several minutes, whereas the C++ version is pretty fast. I assume it’s because it’s doing this recursively and the C++ version manages memory itself via a class MemoryArena. It would be nice to have a proof for this assumption. So I started to look into profiling Rust code and found mainly two helpful sources, Tools for profiling Rust, and a previous post by @llogiq called Profiling Rust applications on Linux. I read through both and installed tools on the way. Because I wanted to focus on a particular part of the code I decided to use the crate cpuprofiler. You can find the latest Rust code of my rs_pbrt library with example parser, called pest_test (see above), on GitHub and some documentation on my web site. I removed the profiling lines again from the GitHub repository because I didn't want to complicate things for Travis CI, but basically you would use the cpuprofiler crate in your Cargo.toml file:

16d15
< cpuprofiler = "0.0.3"

And wrap the recursive function call like this:

...
extern crate cpuprofiler;
...
use cpuprofiler::PROFILER;
...
impl BVHAccel {
    pub fn new(p: Vec<Arc<Primitive + Sync + Send>>,
               max_prims_in_node: usize,
               split_method: SplitMethod)
               -> Self {
...
        PROFILER.lock().unwrap().start("./recursive_build.profile").unwrap();
        let root = BVHAccel::recursive_build(bvh.clone(), // instead of self
                                             // arena,
                                             &mut primitive_info,
                                             0,
                                             num_prims,
                                             &mut total_nodes,
                                             &mut ordered_prims);
        PROFILER.lock().unwrap().stop().unwrap();
...

I kept the resulting profile file in profile/recursive_build.profile.gz in case someone wants to have a look. After unzipping the file via gunzip you could look e.g. via graphviz at the call graph:

pprof -gv ./target/release/examples/pest_test recursive_build.profile

See the two posts mentioned above to find information about pprof and related tools. Feel free to use other profilers. My question for help is this:

How do I identify the bits and pieces of my Rust code which cause the major bottleneck during BVH creation?

Thanks in advance for you help.

HadrienG · October 6, 2017, 8:30pm

If you don't recurse too deeply and have a new enough CPU + Linux kernel combination on your hands, here's how I would have a go at analyzing this using my current favorite performance analysis toy, Linux perf:

Install the perf profiler (package name varies by distro, e.g. "perf" or "linux-tools-common").
Tell cargo that we're going to need debugging symbols. There are various ways, a minimally invasive ones is to run export RUSTFLAGS='-g' in the shell which you are going to use in the remainder of this exercise.
Compile your example with those debugging symbols: cargo build --release
Profile with LBR[1] call graph: perf record --call-graph=lbr cargo run --release. You don't need to let the example run for minutes, profiling a few seconds of execution is sufficient for basic analysis and perf will record the data even if you kill the application with Ctl+C before the end.
Start the profile analysis UI: perf report

The perf report uses a fancy ncurses-based UI that works best on large terminals, so better maximize your terminal window. On the left of the main UI view there are two columns, the first one is the time fraction spent in a function or one of the function that it calls ("children time"), and the second one is the time fraction spent in the function itself ("self time"). The remaining columns identify the function that one is talking about.

In general, a good first approach is to look at functions from top to bottom (i.e. in order of decreasing children time), starting from the first function that you wrote (i.e. ignoring the Rust implementation scaffolding) and paying special attention to functions with high self time (which consume a lot of CPU time on their own).

By pressing Return when a specific function is highlighted, you can also get the fraction of time spent in its various children, which can provide additional insight, particularly when you profile complex applications in which a single function gets called in many different circumstances.

By pressing "a", you will get assembly annotated with associated source code, and will be able to cycle through asm instructions in order of decreasing hotness (sample count) by pressing TAB. When you use this mode, it's easy to end up with statistically insignificant results, so be sure to measure program executions over longer periods of time and have a look at the raw sample count with "t".

That's just a perf profiling crash course, and I'm only scratching the surface of what can be done, but hopefully it should enough to get you going. For more, check out Brendan Gregg's awesome docs @ Linux perf Examples .

[1] A cool profiling technique that leverages the Last Branch Record performance counters of modern CPUs (> Haswell for Intel IIRC) in order to acquire call graphs. Extremely low overhead, allowing sampling at 10kHz without a significant performance impact on the application. Small profile files that get analyzed quickly. However, the depth of the call graph is limited, hence my mention of recursion depth as an important constraint. If that ends up being a problem, try DWARF call graphs.

wahn · October 9, 2017, 8:32am

Thanks @HadrienG for your answer. I will try the perf profiler.

HadrienG · October 9, 2017, 9:05am

Hope you'll like it Just in case, here's why I ended up using perf after trying lots of other profiling tools, in order to help you pinpoint its location in the vast landscape of software performance analysis tools.

Perf measures program activity at the kernel level, using CPU performance monitoring counters. This is unlike other performance analysis tools which emulate a virtual CPU (like Valgrind/Callgrind or the Intel CPU emulator) or which fully implement profiling using user-mode software (like gprof, igprof, or Google's CPU profiler). Here's the tradeoff in a nutshell:

Performance monitoring counters provide very precise information. They are not just for accounting where you spend your CPU cycles, you can also use them to monitor other things like cache misses, branch mispredictions, power management activity, floating-point assists, and all kinds of other "advanced" hardware activity. This data gives you insight on why your program is slow, as opposed to just where it spends its time. Unlike with emulation, you don't need to rely on the (dubious) realism of some mathematical model of a CPU, as what is measured is truly the activity of your CPU.
PMC-based profiling has ultra low overhead. Sampling program activity at 10 kHz or more with negligible overhead becomes feasible, allowing for carrying out basic performance measurements in seconds, rather than minutes, and for very detailed measurements in a reasonable time.
Performance monitoring counters are system-wide, so you can measure behaviours which are inaccessible to user-mode profilers, such as system call activity, blocking on IO or thread contention. They give you a more realistic picture of what your program is doing. Note that due to the security implications of exposing system-wide activity, going for that level of detail will require some administrative privileges on the machine.
On the other hand, performance monitoring counters are hardware-specific. That's their main drawback. Some functionality (such as LBR-based call graph profiling, which I mentioned before) may only be available on recent CPU models. Other useful functionality used to be available, but has been removed from newer CPUs (such as old Intel CPUs' ability to discriminate x87 FPU activity from vectorized floating-point math). There are also hardware limitations on the kind of performance monitoring you can do, which vary from one CPU model to another.

Hopefully, this will give you an idea of why I think PMC-based performance analysis is awesome, and where its limitations lie. Now, why perf over alternatives that use this measurement mechanism like Intel VTune and Oprofile?

PMC-based profilers are very complex tools. It takes a lot of time to master them, and I would rather be able to leverage that investment on a wide range of hardware, rather than lock myself into the ecosystem of one specific hardware manufacturer.
Intel VTune, in particular, has two issues: it is very expensive, and its analysis GUI is very slow. I'm a cheapstake and I don't like to wait on my software, so I would rather go for perf which has neither of these drawbacks. Note that these concerns are specific to VTune, and other hardware-specific profiling tools like NVidia's NVVP (for GPU profiling) are available free of charge and pleasant to use.
As for perf vs oprofile, it boils down to perf being integrated into the Linux kernel codebase, which oprofile isn't. System-wide profiling involves a lot of ugly low-level kernel trickery, and I think the kernel devs are the ones best suited to implement such tricks and maintain them in the long run.

And why do these alternatives exist anyway?

Hardware-specific tools tend to have more pleasant user interfaces, wider hardware support and a lower learning curve, partly because the problem they are solving is simpler (just one family of hardware instead of everything supported by an OS) and partly because that's simply something which HW manufacturers put more focus on than Linux kernel devs.
Oprofile is much older than perf, and that comes with some advantages such as better documentation, a larger community, and native support for profiling of Java applications.

wahn · October 13, 2017, 3:06pm

Thanks again, @HadrienG, I finally removed the bottleneck of BVH creation. Before:

BVHAccel::recursive_build(...)
PT189.892544945S seconds for building BVH ...
BVHAccel::flatten_bvh_tree(...)
PT0.008092472S seconds for flattening BVH ...

After the last checkin:

BVHAccel::recursive_build(...)
PT0.774610292S seconds for building BVH ...
BVHAccel::flatten_bvh_tree(...)
PT0.008037454S seconds for flattening BVH ...

That is over 3 minutes vs. 0.7 second

wahn · October 13, 2017, 3:47pm

A even more extreme example for a more complex scene:

> ./target/release/examples/pest_test -i assets/scenes/art_gallery.pbrt

Before:

BVHAccel::recursive_build(...)
PT1028.495833637S seconds for building BVH ...
BVHAccel::flatten_bvh_tree(...)
PT0.018149381S seconds for flattening BVH ...

After:

BVHAccel::recursive_build(...)
PT1.867626813S seconds for building BVH ...
BVHAccel::flatten_bvh_tree(...)
PT0.018222583S seconds for flattening BVH ...

That's 17 minutes vs. 1.9 seconds ... Funny what changing a couple of lines can do

wahn · October 7, 2019, 9:02am

Is perf still the preferred performance analysis tool? Or are there other tools people use?

HadrienG · October 7, 2019, 10:57am

On Linux, perf is still a solid choice. If you want more choice, you may want to try...

KDAB hotspot: A freeware GUI for perf. Very promising, but somewhat work in progress (it's a little bit too easy to get it to crash, run very slow, or display corrupted stack traces)
Intel VTune: Same general idea as perf, but with a nice GUI and some extra analysis / optimization tips. But it's also commercial (gets quite expensive actually if you don't enjoy some special price) and Intel-specific. Also, the analysis GUI is quite slow, which is sad for a software optimization tool.
AMD CodeXL: VTune's AMD cousin. IIRC, it's free of charge. Didn't try it.
Callgrind/Cachegrind: Valgrind-based, effectively runs your program in a CPU emulator. Not hardware-specific, but measures are very very slow (30x CPU work slowdown wrt native) and CPU model has dubious realism in places. Has a nice GUI called KCacheGrind.
Coz: The new hotness. This "causal profiler" tries to figure out what part of your program is most worth optimizing by slowing down the other parts and studying the impact. This allows you to avoid some classic profiling biases such as focusing too much on CPU w/o checking IO load. Never tried it myself, but @alexcrichton has been hard at work to make it easier to use with Rust, although that effort still seems to be a little WIP right now.

Also, if you're not on Linux, try WPA on Windows and XCode Instruments on macOS.

matklad · October 7, 2019, 2:49pm

A recentish addition to the landscape is a Rust tool flamegraph (based on inferno). It doesn't do anything fancy, just launches perf with appropriate options and renders the output into a flamegraph. The main benefit is that you don't need to open www.brendangregg.com to recall the required perl pipeline. I've found flamegraphs moderately useful to visualize performance of a toy recursive ray tracer.

I still sometimes use gperftools over flamegraphs, for the ability to enable and disable profiling programmatically, which helps if you have a long loading phase which you don't want to profile. For gperftools, I use this small snippet rather than the library from crates.io.

For profiling allocations, heaptrack might be useful, it works with Rust out of the box.

For rust-analyzer, I've also found a hand-rolled hierarchical profiler (source) to be invaluable to quickly diagnose, at a high level, which subsystem is worth looking into. The idea is to insert let _guard = profile("some label") calls manually into the code, which would maintain a stack of timers in TLS. If you also aggregate same-labeld short timers in each frame, you guarantee that the total amount of info printed would be small, which is a nice property, which other profiling techniques lack. I should really blog about this...

wahn · October 8, 2019, 7:55am

Thanks, @HadrienG, I will stick to perf for the moment too, but your suggestions are very welcome. That "causal profiler" sounds interesting (I watched a YouTube video about it). I might give it a try. Thanks for the links ...

wahn · October 8, 2019, 8:04am

Thanks, @matklad, those links are very helpful. I might give heaptrack a try. My current assumption is that I need to implement something like an arena-based allocation. There is a small chapter in the PBRT book about it and the C++ code is obviously using it. So far I avoided to deal with that part of the C++ code because I wanted to have a pixel per pixel match between C++ and Rust first, but now I implemented nearly 100% of the book's C++ code and would like to improve the overall performance. I guess there is a lot of allocation/deallocation happening during rendering for BRDFs and BXDFs, which might run faster if you use a crate like bumpalo ...

HadrienG · October 9, 2019, 8:25am

Note that for this particular use case, if your initialization phase takes predictable time, you do not really need programmatic control. You can either use perf record's --delay option to ask it to only start data collection after some amount of time has elapsed, or use perf report's --time option to ask it to only display data associated with a certain timeframe.

The latter is more flexible, for example it can also be used if your initialization takes variable time, or if you also want to profile app finalization. But if you're not interested in initialization/finalization data at all, it's better to avoid storing it on disk entirely.

Alternatively, if you do need the flexibility of programmatic control, you can also use perf record's --switch-output=signal option in order to programmatically make perf switch between different output files by sending a SIGUSR2 to it.

(Now, flamegraph may not let you do any of that, because as far as I can see it does not allow you to forward arbitrary options to perf. If true, that's just sad, wrappers should not expose only a fraction of the capabilities of the underlying tool...)

Topic		Replies	Views
Performance issue - High complexity code help	17	1819	July 1, 2020
What's the best profiler for Rust?	17	3611	January 12, 2023
Thoughts about profiling Rust/Tokio applications	13	2596	January 29, 2025
Profiling rust code: cpu bound? l1 data cache miss? l1 instr cache miss?	17	4182	July 27, 2021
Profiling in rust application help	7	15882	July 3, 2022

Profilers and how to interpret results on recursive functions

Related topics