Profilers and how to interprete results on recursive functions


#1

Here is a concrete example where I would like to understand the output of various profilers.

A renderer normally has three phases:

  1. You read a scene description via a parser
  2. You try to accelerate mainly ray intersection calls by using acceleration structures like bounding volume hierarchies (BVH)
  3. You do the rendering (ray tracing/path tracing) in parallel (multi-threaded)

While converting the C++ code from the PBRT book to Rust I didn’t care too much about memory efficiency (so far). But I do get some decent results already. Right now the bottleneck for more complex scenes seem to be the recursive building of the BVH. As described here you can render one of the test scenes like this:

> ./target/release/examples/pest_test -i assets/scenes/conference_room.pbrt
...
WorldEnd
DEBUG: rr_threshold = 1
...

Once you see the lines above the BVH gets build (single threaded), and once the mutli-threading starts (watch the CPUs) the BVH is done and rendering starts. This takes several minutes, whereas the C++ version is pretty fast. I assume it’s because it’s doing this recursively and the C++ version manages memory itself via a class MemoryArena. It would be nice to have a proof for this assumption. So I started to look into profiling Rust code and found mainly two helpful sources, Tools for profiling Rust, and a previous post by @llogiq called Profiling Rust applications on Linux. I read through both and installed tools on the way. Because I wanted to focus on a particular part of the code I decided to use the crate cpuprofiler. You can find the latest Rust code of my rs_pbrt library with example parser, called pest_test (see above), on GitHub and some documentation on my web site. I removed the profiling lines again from the GitHub repository because I didn’t want to complicate things for Travis CI, but basically you would use the cpuprofiler crate in your Cargo.toml file:

16d15
< cpuprofiler = "0.0.3"

And wrap the recursive function call like this:

...
extern crate cpuprofiler;
...
use cpuprofiler::PROFILER;
...
impl BVHAccel {
    pub fn new(p: Vec<Arc<Primitive + Sync + Send>>,
               max_prims_in_node: usize,
               split_method: SplitMethod)
               -> Self {
...
        PROFILER.lock().unwrap().start("./recursive_build.profile").unwrap();
        let root = BVHAccel::recursive_build(bvh.clone(), // instead of self
                                             // arena,
                                             &mut primitive_info,
                                             0,
                                             num_prims,
                                             &mut total_nodes,
                                             &mut ordered_prims);
        PROFILER.lock().unwrap().stop().unwrap();
...

I kept the resulting profile file in profile/recursive_build.profile.gz in case someone wants to have a look. After unzipping the file via gunzip you could look e.g. via graphviz at the call graph:

pprof -gv ./target/release/examples/pest_test recursive_build.profile

See the two posts mentioned above to find information about pprof and related tools. Feel free to use other profilers. My question for help is this:

How do I identify the bits and pieces of my Rust code which cause the major bottleneck during BVH creation?

Thanks in advance for you help.


#2

If you don’t recurse too deeply and have a new enough CPU + Linux kernel combination on your hands, here’s how I would have a go at analyzing this using my current favorite performance analysis toy, Linux perf:

  • Install the perf profiler (package name varies by distro, e.g. “perf” or “linux-tools-common”).
  • Tell cargo that we’re going to need debugging symbols. There are various ways, a minimally invasive ones is to run export RUSTFLAGS='-g' in the shell which you are going to use in the remainder of this exercise.
  • Compile your example with those debugging symbols: cargo build --release
  • Profile with LBR[1] call graph: perf record --call-graph=lbr cargo run --release. You don’t need to let the example run for minutes, profiling a few seconds of execution is sufficient for basic analysis and perf will record the data even if you kill the application with Ctl+C before the end.
  • Start the profile analysis UI: perf report

The perf report uses a fancy ncurses-based UI that works best on large terminals, so better maximize your terminal window. On the left of the main UI view there are two columns, the first one is the time fraction spent in a function or one of the function that it calls (“children time”), and the second one is the time fraction spent in the function itself (“self time”). The remaining columns identify the function that one is talking about.

In general, a good first approach is to look at functions from top to bottom (i.e. in order of decreasing children time), starting from the first function that you wrote (i.e. ignoring the Rust implementation scaffolding) and paying special attention to functions with high self time (which consume a lot of CPU time on their own).

By pressing Return when a specific function is highlighted, you can also get the fraction of time spent in its various children, which can provide additional insight, particularly when you profile complex applications in which a single function gets called in many different circumstances.

By pressing “a”, you will get assembly annotated with associated source code, and will be able to cycle through asm instructions in order of decreasing hotness (sample count) by pressing TAB. When you use this mode, it’s easy to end up with statistically insignificant results, so be sure to measure program executions over longer periods of time and have a look at the raw sample count with “t”.

That’s just a perf profiling crash course, and I’m only scratching the surface of what can be done, but hopefully it should enough to get you going. For more, check out Brendan Gregg’s awesome docs @ http://www.brendangregg.com/perf.html .


[1] A cool profiling technique that leverages the Last Branch Record performance counters of modern CPUs (> Haswell for Intel IIRC) in order to acquire call graphs. Extremely low overhead, allowing sampling at 10kHz without a significant performance impact on the application. Small profile files that get analyzed quickly. However, the depth of the call graph is limited, hence my mention of recursion depth as an important constraint.


Request for review - an example of using std::sync::mpsc for cryptographic puzzle
Unable to exploit all the cores, suggestions?
Profiling in rust application
#3

Thanks @HadrienG for your answer. I will try the perf profiler.


#4

Hope you’ll like it :slight_smile: Just in case, here’s why I ended up using perf after trying lots of other profiling tools, in order to help you pinpoint its location in the vast landscape of software performance analysis tools.


Perf measures program activity at the kernel level, using CPU performance monitoring counters. This is unlike other performance analysis tools which emulate a virtual CPU (like Valgrind/Callgrind or the Intel CPU emulator) or which fully implement profiling using user-mode software (like gprof, igprof, or Google’s CPU profiler). Here’s the tradeoff in a nutshell:

  • Performance monitoring counters provide very precise information. They are not just for accounting where you spend your CPU cycles, you can also use them to monitor other things like cache misses, branch mispredictions, power management activity, floating-point assists, and all kinds of other “advanced” hardware activity. This data gives you insight on why your program is slow, as opposed to just where it spends its time. Unlike with emulation, you don’t need to rely on the (dubious) realism of some mathematical model of a CPU, as what is measured is truly the activity of your CPU.
  • PMC-based profiling has ultra low overhead. Sampling program activity at 10 kHz or more with negligible overhead becomes feasible, allowing for carrying out basic performance measurements in seconds, rather than minutes, and for very detailed measurements in a reasonable time.
  • Performance monitoring counters are system-wide, so you can measure behaviours which are inaccessible to user-mode profilers, such as system call activity, blocking on IO or thread contention. They give you a more realistic picture of what your program is doing. Note that due to the security implications of exposing system-wide activity, going for that level of detail will require some administrative privileges on the machine.
  • On the other hand, performance monitoring counters are hardware-specific. That’s their main drawback. Some functionality (such as LBR-based call graph profiling, which I mentioned before) may only be available on recent CPU models. Other useful functionality used to be available, but has been removed from newer CPUs (such as old Intel CPUs’ ability to discriminate x87 FPU activity from vectorized floating-point math). There are also hardware limitations on the kind of performance monitoring you can do, which vary from one CPU model to another.

Hopefully, this will give you an idea of why I think PMC-based performance analysis is awesome, and where its limitations lie. Now, why perf over alternatives that use this measurement mechanism like Intel VTune and Oprofile?

  • PMC-based profilers are very complex tools. It takes a lot of time to master them, and I would rather be able to leverage that investment on a wide range of hardware, rather than lock myself into the ecosystem of one specific hardware manufacturer.
  • Intel VTune, in particular, has two issues: it is very expensive, and its analysis GUI is very slow. I’m a cheapstake and I don’t like to wait on my software, so I would rather go for perf which has neither of these drawbacks. Note that these concerns are specific to VTune, and other hardware-specific profiling tools like NVidia’s NVVP (for GPU profiling) are available free of charge and pleasant to use.
  • As for perf vs oprofile, it boils down to perf being integrated into the Linux kernel codebase, which oprofile isn’t. System-wide profiling involves a lot of ugly low-level kernel trickery, and I think the kernel devs are the ones best suited to implement such tricks and maintain them in the long run.

And why do these alternatives exist anyway?

  • Hardware-specific tools tend to have more pleasant user interfaces, wider hardware support and a lower learning curve, partly because the problem they are solving is simpler (just one family of hardware instead of everything supported by an OS) and partly because that’s simply something which HW manufacturers put more focus on than Linux kernel devs.
  • Oprofile is much older than perf, and that comes with some advantages such as better documentation, a larger community, and native support for profiling of Java applications.

#5

Thanks again, @HadrienG, I finally removed the bottleneck of BVH creation. Before:

BVHAccel::recursive_build(...)
PT189.892544945S seconds for building BVH ...
BVHAccel::flatten_bvh_tree(...)
PT0.008092472S seconds for flattening BVH ...

After the last checkin:

BVHAccel::recursive_build(...)
PT0.774610292S seconds for building BVH ...
BVHAccel::flatten_bvh_tree(...)
PT0.008037454S seconds for flattening BVH ...

That is over 3 minutes vs. 0.7 second :sweat_smile:


#6

A even more extreme example for a more complex scene:

> ./target/release/examples/pest_test -i assets/scenes/art_gallery.pbrt

Before:

BVHAccel::recursive_build(...)
PT1028.495833637S seconds for building BVH ...
BVHAccel::flatten_bvh_tree(...)
PT0.018149381S seconds for flattening BVH ...

After:

BVHAccel::recursive_build(...)
PT1.867626813S seconds for building BVH ...
BVHAccel::flatten_bvh_tree(...)
PT0.018222583S seconds for flattening BVH ...

That’s 17 minutes vs. 1.9 seconds … Funny what changing a couple of lines can do :+1:


What's everyone working on this week (42/2017)?