Here is a concrete example where I would like to understand the output of various profilers.
A renderer normally has three phases:
- You read a scene description via a parser
- You try to accelerate mainly ray intersection calls by using acceleration structures like bounding volume hierarchies (BVH)
- You do the rendering (ray tracing/path tracing) in parallel (multi-threaded)
While converting the C++ code from the PBRT book to Rust I didn't care too much about memory efficiency (so far). But I do get some decent results already. Right now the bottleneck for more complex scenes seem to be the recursive building of the BVH. As described here you can render one of the test scenes like this:
> ./target/release/examples/pest_test -i assets/scenes/conference_room.pbrt
...
WorldEnd
DEBUG: rr_threshold = 1
...
Once you see the lines above the BVH gets build (single threaded), and once the mutli-threading starts (watch the CPUs) the BVH is done and rendering starts. This takes several minutes, whereas the C++ version is pretty fast. I assume it’s because it’s doing this recursively and the C++ version manages memory itself via a class MemoryArena
. It would be nice to have a proof for this assumption. So I started to look into profiling Rust code and found mainly two helpful sources, Tools for profiling Rust, and a previous post by @llogiq called Profiling Rust applications on Linux. I read through both and installed tools on the way. Because I wanted to focus on a particular part of the code I decided to use the crate cpuprofiler. You can find the latest Rust code of my rs_pbrt library with example parser, called pest_test
(see above), on GitHub and some documentation on my web site. I removed the profiling lines again from the GitHub repository because I didn't want to complicate things for Travis CI, but basically you would use the cpuprofiler crate in your Cargo.toml file:
16d15
< cpuprofiler = "0.0.3"
And wrap the recursive function call like this:
...
extern crate cpuprofiler;
...
use cpuprofiler::PROFILER;
...
impl BVHAccel {
pub fn new(p: Vec<Arc<Primitive + Sync + Send>>,
max_prims_in_node: usize,
split_method: SplitMethod)
-> Self {
...
PROFILER.lock().unwrap().start("./recursive_build.profile").unwrap();
let root = BVHAccel::recursive_build(bvh.clone(), // instead of self
// arena,
&mut primitive_info,
0,
num_prims,
&mut total_nodes,
&mut ordered_prims);
PROFILER.lock().unwrap().stop().unwrap();
...
I kept the resulting profile file in profile/recursive_build.profile.gz
in case someone wants to have a look. After unzipping the file via gunzip
you could look e.g. via graphviz at the call graph:
pprof -gv ./target/release/examples/pest_test recursive_build.profile
See the two posts mentioned above to find information about pprof
and related tools. Feel free to use other profilers. My question for help is this:
How do I identify the bits and pieces of my Rust code which cause the major bottleneck during BVH creation?
Thanks in advance for you help.