CLI parameters severely affect performance of application even when not used

My application is a CPU driven raytracer. It takes arguments from the terminal parses them and packs them into a struct then passes to a function that then renders accordingly or prints info and exits. The application has certain defaults that are used when certain parameters are not specified which are default filename, resolution and samples. This is the main loop that may give you an indication of how CLI parameters are handled:

fn main() {
    let args: Vec<String> = env::args().collect();
    match parameters::process_args(args) {
        Some((scene, parameters)) => {
            scene.generate_image_threaded(parameters);
        }
        None => {}
    }
}

The problem is if you specify the default filename manually it performs differently when you do not specify them even though the function returns the same value. Notice how I said differently as strange as it sounds some days it performs worse sometimes better. But the margin is large and is consistent at the time of testing.
For example, passing these parameters. (-O is output file that defaults to out.png)
-S 1 -X 1920 -Y 1080 -N 50 -O out.png
performs differently to
-S 1 -X 1920 -Y 1080 -N 50
One takes around 30s one takes around 50s.
Running the executable directly also performs differently from running it with cargo run --release. 30s vs 50s again.
I have done hours of testing but I will not post that here because I get different results between reboots but the results are consistently different between -O out.png and not doing -O out.png. In fact even if I pass a garbage flag such as -Z out.png (doesn't do anything) the performance difference still is there.
The way I measure performance is using instants before and after the render loop which is well after the parameters function. The issue is reproducible on other peoples machines to varying degrees.
After discussing this issue with some people who are more knowledgeable than me the current and only prevailing theory is that alignment of data relative to memory page boundaries is causing this issue.
Here is the repo: https://github.com/NonL4331/Raytracing-Rust.
I have also tested hardcoding the parameters and removing all CLI parameter handling and it still demonstrated the issue.
What cause be causing this and what can I do to prevent this?

1 Like

That's quite odd and makes me think there's something iffy with your parameter parsing where you think you are providing identical input but something else is being set accidentally. Try dropping a println!("{:?} {:?}", scene, parameters) before generating the image and seeing whether that makes a difference.

Instead of manually testing things from the command-line, I'd also suggest writing automated tests for process_args(). That way you can provide a list of inputs from the command line (e.g. &["-S", "1", "-X", "1920", "-Y", "1080", "-N", "50"]) and a list of the expected Scene and Parameters, then make sure your parameter parsing does exactly what you expect.

A bit of variability is to be expected because your computer may be doing other things in the background, but the fact that there is a consistent difference (and it happens to multiple people) is good because we can rule out external programs screwing up your timing :+1:

Out of curiosity, how were you testing this?

You should try to generate a flamegraph (e.g. with cargo-flamegraph) for each run to visually compare where your program spends its time depending on the output filename. If you notice one run spends a lot more time in function foo() than another, it'll be a good indicator that the logic around that area is different.

2 Likes

Hi thanks for the quick response.
At the end of the program I print the parameters used in the program and I can check that it has rendered the scene to the right image as intended as it produces an out.png file. I have generated the two flame-graphs they seems to be slightly different in a non obvious way. During the flamegraph generation, the slower of the two spat out this warning

Warning:
Processed 826473 events and lost 17 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 6374.112 MB perf.data (791879 samples) ]
writing flamegraph to "nooutpng.svg"

My testing in the past consists of running the program with certain arguments 5 times for set of parameters, the program prints the time taken over the main render loop. I have repeated on different days and gotten different results. I can't seem to attach both images as a new user on the rust forums at a time so I will send an Imgur link:
Imgur
Imgur

This is the main render loop and how I compare performance:

  let start = Instant::now();

        chunk_sizes.par_iter().for_each(|&chunk_size| {
            let image_part = self.get_image_part(width, height, chunk_size, pixel_samples);

            let mut main_image = image.lock().unwrap();

            for (value, sample) in (*main_image).iter_mut().zip(image_part.iter()) {
                *value += sample;
            }
        });

        let end = Instant::now();

The issue still applies when there is no parameter parsing and all the values are hardcoded.

Also just to clarify I mean that sometimes the -O out.png is around 50s and the non -O out.png is around 30s and sometimes its the other way around the -O out.png is 30s and the non -O out.png is 50s.

I should also mention in all my testing the performance of passing certain arguments in a certain way seems to be consistent (within margin or error) apart from a single command that took 28s when the other couple runs of those exact parameters took around 52s. That was the only occurrence of that in 100's of runs of my program.

Hi there,

Just chiming in to say that I tried your code and I see the same difference in performance that you do. As you said, it even happens when I add a bogus command line parameter like "thequickbrownfox". Linux perf reports the same number of instructions run in both cases, but the case with the extra command line parameter takes about twice as long and has a lot more stalled cycles.

One extra thing I noticed: if I replace .par_iter() with .iter() in generate_image_threaded (and run with a single thread), I no longer see the performance difference. I'm not sure why thread counts would be interacting with the length of the command line like this, but just another little data point.

Quite the puzzle!

Mark

1 Like

I was fiddling around some more today, and found that the variation in performance also goes away if I replace BVH.get_indices with a non-recursive version (code below).

I'm still not sure how this relates to accessing command line arguments and/or multi-threading.

    fn get_indices(&self, node: NodeIndex, ray: &Ray) -> Vec<NodeIndex> {
        let mut queue: std::collections::VecDeque<NodeIndex> = std::collections::VecDeque::new();
        let mut result: Vec<NodeIndex> = Vec::new();
    
        queue.push_back(node);
    
        while queue.len() > 0 {
            let idx = queue.pop_front().unwrap();
            let node = &self.nodes[idx as usize];
    
            if !node.aabb.does_int(ray) {
                continue;
            }
            
            match (node.left_child, node.right_child) {
                (Some(left), Some(right)) => {
                    queue.push_back(left);
                    queue.push_back(right);
                }
                (Some(left), None) => {
                    queue.push_back(left);
                }
                (None, Some(right)) => {
                    queue.push_back(right);
                },
                (None, None) => {
                    result.extend(node.data.clone());
                }
            }
        }
    
        result
    }
1 Like

Thanks for all your contributions I really appreciate them. It is quite odd that a read-only function BVH.get_indices is interacting with the rest of the program in such an odd way.
Do you mind if I use your non-recursive implementation of get_indices for my code? Although I will not change it till we figure out what exactly is going on here since I would like to avoid such issues in the future and think it will be a great learning experience.
Again thank you for helping out I really appreciate all the help I can get. :slight_smile: cheers.

I don't mind at all--please use it if it's useful.

Out of curiosity, are you running this on an AMD CPU? One other thing I noticed was that I see the slowdown on my Ryzen 3900x, but not on my Intel laptop...

1 Like

I am running this on my laptop which has an AMD 4800HS. I also have tested this on my desktop Intel i7 6700k which does show the same issue but to a lesser degree. Some other people have seen little or no difference on Intel CPUs (another separate i7 6700k CPU tested). Every AMD CPU that has been tested seems to have a significant difference including a 5950x and 5600x.

Out of curiosity I ran you program on my MacBook Pro M1:

➜  Raytracing-Rust git:(main) ./target/release/cpu_raytracer -S 1 -X 1920 -Y 1080 -N 50 -O out.png 
------------------------------
Finised rendering image!
------------------------------
Width: 1920
Height: 1080
Samples per pixel: 50
Render Time: 1 minute, ~0 seconds
------------------------------
➜  Raytracing-Rust git:(main) ./target/release/cpu_raytracer -S 1 -X 1920 -Y 1080 -N 50           
------------------------------
Finised rendering image!
------------------------------
Width: 1920
Height: 1080
Samples per pixel: 50
Render Time: 1 minute, 3 seconds
------------------------------

So, no difference on the ARM processor of the M1.

2 Likes

Right! Well at least that's consistent with what I'm seeing too :slight_smile:

I have to admit that I was a little nerd-sniped by this, so I tried profiling using AMD uProf. Surprisingly, the major bottleneck seems to be this slice index in BVH.get_indices:

let node = &self.nodes[node as usize];

and uProf reports high numbers of MISALIGNED_LOADS, which I gather occur when a memory load crosses a cache line or page boundary. I found I can reduce those by forcing nodes to align to 32-byte boundaries like:

#[repr(align(32))]
struct Node {
...

The other thing I realised was that there's nothing special about the command line here: any heap allocation of the right size is enough to affect performance. I just added a line to main.rs:

let junk = "thequickbrownfox".to_string();

along with a debug print of junk at the end (to stop it being optimised away), and that had the same impact that the command line switch had.

So my working theory is:

  • the high number of threads running heavily recursive code is generating a lot of memory traffic

  • when memory is busy, the misaligned load penalties are higher (maybe we're missing the caches more frequently)

  • small heap allocations (such as command line arguments) affect the placement of the self.nodes Vec in memory, and the performance ends up being determined by which Node structs land on cache line boundaries. If a heavily accessed node ends up on a boundary, the misaligned load penalty has a noticeable impact

Avoiding the recursion in the hot code paths like BVH.get_indices seems like the most robust way to avoid this :slight_smile:

9 Likes

Thank you so much I would never have figured this out on my own! Funny enough I was planning to get rid of recursion in my code anyway since I was planning to add the option to use a GPU.
Again thank you :slight_smile:

1 Like

This reminds me a lot of this paper, "Producing Wrong Data Without Doing Anything Obviously Wrong!" by Mytkowicz, Diwan, Hauswirth, and Sweeney:

This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a significant bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences.

The authors show that changing the size of the environment variables affects the performance of a C program more than the difference between compiling with -O2 and -O3. This is because Linux (like most Unixes) passes environment variables at the base of the stack, so the size of the environment affects the alignment of the stack pointer. This means that performance may even be observably affected by the username of the person running the benchmark.

Command-line arguments are also passed on the stack, so they can affect stack alignment as well.

I'm going to bet that removing the recursion affected performance by making your program less sensitive to the stack's alignment.

This talk by Emery Berger, Performance Matters, covers some of these same ideas, and presents the tools he made to overcome these sources of bias.

8 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.