Is there an overhead when using `&mut Struct` fields?

Hello all,

I have recently been trying to optimise my ray-tracer, and some recent changes made a big (NEGATIVE) difference, which I was not expecting at all.

Let me show some code snippets. I made similar changes in other functions as well, so I am only showing you one (please don't be too harsh... I was about to clean the code when I found this, and decided to publish earlier so I could ask the question)

Original case --> check it out HERE

So, I originally had something like this (I cleaned it a bit):

fn get_global_illumination(
        &self,
        scene: &Scene,
        n_ambient_samples: usize,
        n_shadow_samples: usize,
        current_depth: usize,
        material: &Box<dyn Material + Sync>,
        normal: Vector3D, // I am planning to get rid of this
        e1: Vector3D, // I am planning to get rid of this
        e2: Vector3D, // I am planning to get rid of this
        ray: &mut Ray,
        intersection_pt: Point3D,
        wt: Float,  // This also lives in &Ray, so I don't want to use it as an input
        rng: &mut RandGen,
        aux: &mut RayTracerHelper
    ) -> Spectrum {
        let mut global = Spectrum::black();
        let depth = current_depth;
        aux.rays[depth] = *ray; // Aux rays is a stack, used for avoiding allocations

        for _ in 0..n_ambient_samples {
            // Choose a direction.
            let (bsdf_value, _is_specular) =
                material.sample_bsdf(normal, e1, e2, intersection_pt, ray, rng);
            let new_ray_dir = ray.geometry.direction;

            // increase depth
            let new_depth = current_depth+ 1;                                               
            let cos_theta = (normal * new_ray_dir).abs();
            let new_value = wt * bsdf_value * cos_theta;

            let color = material.colour();

            let (li, light_pdf) = self.trace_ray(rng, scene, ray, new_depth, new_value, aux);

            let fx = (li * cos_theta) * (color * bsdf_value); 

            let denominator = bsdf_value * n_ambient_samples as Float + n_shadow_samples as Float * light_pdf;

            global += fx / denominator;

            // restore ray, because it was modified by trace_ray executions
            *ray = aux.rays[depth];
        }
        // return
        global
    }

What I thought would be an improvement --> check it out HERE

So, I decided to carry some of the information—e.g., depth and value—in the ray itself, like so:

fn get_global_illumination(
        &self,
        scene: &Scene,
        n_ambient_samples: usize,
        n_shadow_samples: usize,        
        material: &Box<dyn Material + Sync>,
        normal: Vector3D, // I am planning to get rid of this
        e1: Vector3D, // I am planning to get rid of this
        e2: Vector3D, // I am planning to get rid of this
        ray: &mut Ray,   
        rng: &mut RandGen,
        aux: &mut RayTracerHelper
    ) -> Spectrum {
        let mut global = Spectrum::black();
        let depth = ray.depth; // This was previously an input
        aux.rays[depth] = *ray;
        let intersection_pt = ray.interaction.point; // this value was previously an input

        for _ in 0..n_ambient_samples {
            // Choose a direction.
            let (bsdf_value, _is_specular) =
                material.sample_bsdf(normal, e1, e2, intersection_pt, ray, rng);
            let new_ray_dir = ray.geometry.direction;

            ray.depth += 1; //  instead of  `let new_depth = current_depth+ 1;`
            let cos_theta = (normal * new_ray_dir).abs();
            ray.value *= bsdf_value*cos_theta; // instead of `let new_value = wt * bsdf_value * cos_theta;`

            let color = material.colour();

            let (li, light_pdf) = self.trace_ray(rng, scene, ray, aux);

            let fx = (li * cos_theta) * (color * bsdf_value); 
            let denominator = bsdf_value * n_ambient_samples as Float + n_shadow_samples as Float * light_pdf;

            global += fx / denominator;

            // restore ray, which was modified
            *ray = aux.rays[depth];
        }
        // return
        global
    }

Result:

The result is that both functions produce the same image, but THE FIRST VERSION TAKES 60sec WHILE THE SECOND TAKES 90sec

Any clues as to why this is happening? The only difference is that some values went from being an input to a field in the Ray.

It's hard to know without wider context, but maybe because now a reference for them is taken they cannot be SSA values, and thus cannot be put in registers and are badly optimized?

This jumps out to me as suspicious for optimization impact.

Writing through a pointer that's passed in makes the optimizer's job way harder to figure out what's going on. When you just have a local wt: Float, it's very easy for it to tell that it's never different, and thus do things like pull computations out of the loop.

A Ray looks like it's clone. Does it do any memory allocation? You might be able to make the compiler's job much easier by cloning it to a local -- especially if it can be inside the loop efficiently -- so it becomes incredibly obvious that nothing affecting it can affect other iterations.

Notably, LLVM might not be able to prove that the call to trace_ray doesn't change aux.rays, and thus it doesn't actually know that it'll be restored. If it somehow got shorter, then the aux.rays[depth] could panic, resulting in *ray not getting restored.

2 Likes

Generally, nothing can be said about the performance of such insignificant code transformations. They may very well cause unpredictable changes in performance, based on some compiler heuristics. E.g. maybe the change in parameters caused some function to not be inlined, thus missing optimizations? Or maybe those "optimizations" end up being pessimizations, because they block some more performant transformation? In that case hindering inlining may be a good thing. Maybe you have increased the register pressure so that some common variable is constantly swapped in and out? Maybe some indirection wasn't successfully removed? Etc, etc.

You'd need to provide a specific benchmark for that function which can be analyzed, including the asm level. But in general, I'd say that such wild fluctuations in performance mean that the function is poorly optimized, so I would look at some higher-value optimizations that could be applied to the code, like helping the autovectorizer. E.g. you use index-based iteration and subscripting, which can cause panics on access and unnecessary bounds checks. Both of them can directly tank your performance and inhibit autovectorization. I would suggest using iterators, and adding assertions to ease the job of eliding bounds checks for the compiler.

2 Likes

Well, what I was trying to do was to AVOID memory allocations (I noticed they were causing issues in some other parts of the code). I am doing this by pre-allocating a stack of rays in aux.rays, to/from which I then just Copy

I'll check what happens when I just allow for allocation again.

Thanks! I'll look into this and let you all know!

As far as I can tell, aux.rays is used exclusively to roll-back the value of ray during the execution of get_global_illumination. It's not modified anywhere else, which begs the question: why even have that home-spun stack when you can just store a copy in a local variable? You're using the normal function call stack anyway. It would also remove the artificial limit 10 on call depth (you're not changing the length of aux.rays vector anywhere outside of default initialization) and likely help the optimizer.

2 Likes

Are you able to test a minimal case on rust.godbolt.org? (don't forget to add -O to the flags there)


BTW: are you intentionally using &Box? It makes the pointer thin, but access to the dyn object's data goes through a double indirection. &dyn Material fat pointer would avoid having a pointer to a pointer.

3 Likes

I tried to get some info from carg-asm, but did not work (yesterday it did... :/) . It is hard to get something to compile on rust.godbolt.com because of the amount of interconnection between types here. Any advice?

I did this, but did not help. I now have an un-waitably slow program. Ray is small right now, but the whole plan of avoiding allocations was to allow it to get bigger and thus to allow more parameters to further improve the rendering algorithm.

Originally the Ray was passed by value, and I think the compiler then decided to inline everything, which did not make it slow at all. However, now that I am passing by reference, it would need to allocate a ray a ginormous number of times per simulation.

If you mark the function as #[inline(always)] (just to investigate the problem, don't do that in production code!), does it help?

You can run RUSTFLAGS="--emit llvm-ir" cargo build --release, and search for the $YOUR_CRATE.ll file in ./target/release/deps (or --emit asm if you really want to read assembler). But be warned, I did this to your project and the resulting output is huge. The private functions are inlined, and the trace_ray function IR spans many pages. I find it very hard to draw any conclusions from such output. For that reason I would recommend you to make a trimmed version of your project which would focus as much as possible exclusively on the important function. Compiling with panic = "abort" may simplify the output (but may also change the performance profile, so benchmark always).

So, I suspect that the reason for this was NOT an overhead over the fields, but the fact that another function was being inlined/non-inlined.

Now I am working on that function.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.