Why is this code much slower than C++? (BVH Traversal)


So, I have been developing a ray-tracer... but it is slow. After some checking, I have realised that my Scene-Intersection accelerator—a BVH—is slow. I mean, much slower than the benchmarks I have seen.

Now, my intention was to implement the code in PBRT v.4 (link to function) but in Rust... I am obviously not doing so. It is accurate—i.e., the engine runs just fine—but it is too painfully slow.

Here is my function

/// Returns an `Option<Interaction>`, containing the first primitive
    /// to be hit by the ray, if any
    pub fn intersect(
        primitives: &[Object],
        ray: &mut Ray,
        nodes_to_visit: &mut Vec<usize>,
    ) -> bool {
        const MIN_T: Float = 0.0000001;

        if self.nodes.is_empty() {
            return false;
        // reset

        let mut prim_index: Option<usize> = None;
        let mut t_squared = Float::MAX;

        let inv_x = 1./ray.geometry.direction.x;
        let inv_y = 1./ray.geometry.direction.y;        
        let inv_z = 1./ray.geometry.direction.z;

        let inv_dir = Vector3D::new(inv_x, inv_y, inv_z);
        let dir_is_neg = (inv_dir.x < 0., inv_dir.y < 0., inv_dir.z < 0.);
        let mut current_node = 0;

        for _ in 0..self.nodes.len() {
            let node = &self.nodes[current_node];
            if node.bounds.intersect(&ray.geometry, &inv_dir) {
                if node.is_leaf() {                    
                    let offset = node.next;
                    // Check all the objects in this Node
                    for i in 0..node.n_prims {
                        if let Some(intersect_info) =
                            primitives[offset as usize + i as usize].primitive.intersect(&ray.geometry)
                            // If hit, check the distance.
                            let this_t_squared =
                                (intersect_info.p - ray.geometry.origin).length_squared();                                
                            // if the distance is less than the prevous one, update the info
                            if this_t_squared > MIN_T && this_t_squared < t_squared {
                                // If the distance is less than what we had, update return data
                                t_squared = this_t_squared;
                                prim_index = Some(offset as usize + i as usize);
                                ray.interaction.geometry_shading = intersect_info;

                    // update node we need to visit next, if any... otherwise, finish
                    if let Some(i) = nodes_to_visit.pop() {
                        current_node = i;
                    } else {
                    // is interior... choose first or second child,
                    // add to the stack
                    let is_neg = match node.axis {
                        BBoxAxis::X => dir_is_neg.0,
                        BBoxAxis::Y => dir_is_neg.1,
                        BBoxAxis::Z => dir_is_neg.2,
                    if is_neg {
                        nodes_to_visit.push(current_node + 1);
                        current_node = node.next as usize;
                    } else {
                        nodes_to_visit.push(node.next as usize);
                        current_node += 1;
            } else if let Some(i) = nodes_to_visit.pop() {
                current_node = i;
            } else {
        } // End loop

        // return
        if let Some(index) = prim_index {
            let t = t_squared.sqrt();

            ray.interaction.point = ray.geometry.project(t);
            ray.interaction.wo = ray.geometry.direction * -1.;
            ray.interaction.prim_index = index;

        } else {

I have also checked some other references:

Any suggestions as to why is this happening?

Kind regards! I am loving this forum and this community.

Obligatory question, since I didn't see you mention it: you're running with --release, right?

Hi! I am actually benchmarking through Criterion... the other tests, yes, with --release


Any reference for the Self and Ray types? It seems to me that the time consuming part could be calling "intersect"

1 Like

Does your C++ benchmarks use any flags like -ffast-math? Rust doesn't have an analogue.

Do you have a benchmark code which we could run and check?

This is likely not the cause, but are you compiling C++ with CLang? It should have similar output to Rust, but may differ from GCC. The difference is usually measured in percents, though.

Indexing checks are very unlikely to be the culprit in this code, but just in case you can try changing your indexing operations into get_unchecked calls. If it helps, we may ponder how to do the same thing safely.


thanks for reaching out. The rest of the code is kind of large (you can check it out HERE), but I suspect you are right. I am working with alignments and things of that sort. It is just that when I tested this with Apple's Instruments (which—as opposed to Flamegraph, which I prefer otherwise—shows inlined code) showed this function to be the issue.

I'll keep you posted!

I am not using C++ myself, just using as reference my experience with Radiance and what other packages report as reference values.

I guess that, if you all are not seeing any self-evident sources of poor performance, then the issue must be somewhere else. Shouldn't it?

Are you benchmarking on the same machine ?

Your current_node handling is a bit complicated, as you iterate over the number of nodes, but don't do anything with that number. So maybe the compiler can't assert statically that current_node will always be a good index, and so the node access at the start of each iteration will have a runtime check.

You could try

   let node = unsafe { self.nodes.get_unchecked(current_node) };

If that works, you may experiment with ways to get rid of the unsafe and have the compiler see for itself that the index will be in-range. One way may be to change the outer for loop to while current_node < self.nodes.len() (and making sure to set a too-large current_node when it should terminate).

Well, maybe, but it doesn't hurt to check. Massive performance differences in such kinds of functions are usually due to the autovectorization failing to apply. Most common reasons are floating point nonassociativity and problematic bounds checks. So that's the direction which I would investigate first.

You can try looking at the assembly with

RUSTFLAGS="--emit asm" cargo build --release

and search for SIMD opcodes in the listing of your function.

Hello all, thanks for answering!

So, I think (PART OF) the issue was related to dispatching, aligning and packing.

So, originally the content of primitives[offset] was an Object

pub enum Primitive {
pub struct Object {
    pub primitive: Primitive,
    pub front_material_index: usize,
    pub back_material_index: usize,

Where the primitives look something like this:

pub struct Sphere3D {
    /// The radius of the `Sphere3D`
    pub radius: Float,

    /// The position of the lower clipping plane on the Z axis
    zmin: Float,

    /// The posiftion of the upper clipping plane on the Z axis
    zmax: Float,

    /// The range of longitudes allowed for the sphere
    phi_max: Float,

    /// The range of Latitudes comprised within the allowed `zmin` and `zmax`
    delta_theta: Float,

    /// The minimum latitude allowed by `zmin`
    theta_min: Float,

    /// A pointer to the [`Transform`] associated with this [`Sphere3D`]
    transform: Option<RefCount<Transform>>,
    // Does the transform change the hand-ness of the coordinate system?
    // transform_reverses: bool,

This means that—at every intersection—my code had to go through:

  1. get the primitive --> primitives[i].primitive
  2. Call intersect --> match self {/*... identify specific object*/}
  3. Check whether there was a transform and transform the ray --> if let Transform(t) = self.transform {ray.transform()}
    3.1 (Note that this step implies following an Rc)
  4. Intersecting, and returning.

So, yesterday I made some changes. One of them was the following:

// Redefined Primitive as Triangle (now I only handle triangles)
pub struct Triangle {
    pub vertices: [Float; 9],
    pub front_material_index: usize,
    pub back_material_index: usize,

The benefits are massive (reduced times to about half), I suspect, mainly because:

  1. Triangles don't need transformations, so ignore point 3 above
  2. There is no match statement

Now, I wonder whether I need to be able to handle non-triangular primitives now... any suggestions?

This is a pretty classic example. Use a struct-of-arrarys:

struct Primitives {
  spheres: Vec<Sphere3D>,
  triangles: Vec<Triangle3D>,

Then loop through each type separately. This essentially let's the CPU figure out what you're doing far better because it's going through the same code over and over, along with packing the data a bit better.

There are probably libraries that will do this internally for you while keeping a simple flat collection API, but they're overkill for the simple case above. Adding things like optional materials and transforms and the like gets you into ECS territory though. That gets really interesting, but probably not what you want to spend time on when you're messing around, they are really big enough to be their own mess around to learn thing.

1 Like

Thanks! But the problem with that approach is that it does not really allow me to create a data structure for accelerating intersections (i.e., in this case, a BVH). For that, I need to sort all the objects in the scene in a way that allows quickly skipping those that—based on current information—we know will not be intersected.

I ended up transforming everything into Triangles.

type Triangle = [f64;9];

and then do functional programming

pub fn simple_triangle_intersect(
    t: &Triangle,
    ray: &geometry3d::Ray3D,
) -> Option<geometry3d::Point3D> {

Sticking with triangles should absolutely be fine!

But, should you desire, you may still be able to independently BVH each primitive type (say they implement a HasBounds trait used by a BVH) and get better performance than the original since the code is more predictable to the CPU. Probably not as good as just using triangles until you run off the end of a CPU cache level, but perhaps you have some other use.

A unified BVH may be possible, but at that point you're probably not going to get much benefits from a struct of arrays if you're doing the BVH traversal to get potential matches then testing each, since you're only likely to have a handful of hits of each type.