Hi, I'm building a raytracer for practice. I did the entire thing as single threaded, and now I'm converting it to multi-threaded.
I initially mutated a string which was SUPER fast (but out of order obviously)
I changed it to this method and now its multiple orders of magnitude slower.
How is the best way to achieve this sort of "multi-threaded write index to array in the fastest way possible"
Edit: As an aside... im also really confused how its so much slower. I thought it was the mutex, but my old code which wrote to a string used this inside the inner for loop, which still is using a mutex:
Here is me trying to improve it and making worse performance:
pub fn render<T: Hittable>(&self, world: &T) {
let bar = progress_bar(self.image_height as u64);
let buffer = Arc::new(Mutex::new(vec![
Color::new(0, 0, 0);
self.image_width as usize
* self.image_height as usize
]));
(0..self.image_height).into_par_iter().for_each(|j| {
(0..self.image_width).into_par_iter().for_each(|i| {
let index = (j * self.image_width + i) as usize;
for _ in 0..self.samples_per_pixel {
let r = self.get_ray(i, j);
buffer.lock().unwrap()[index] +=
self.ray_color(r, self.max_depth, world) * self.pixel_samples_scale;
}
});
bar.inc(1);
});
let mut out = format!("P3\n{} {}\n255\n", self.image_width, self.image_height);
for color in buffer.lock().unwrap().iter() {
write_color(&mut out, *color);
}
print!("{}", out);
}
You should be doing coarse grained work in each worker thread. The best way is breaking the image into disjoint chunks that each thread can work on independently. A chunk in this case is just a contiguous run of the image buffer which may span multiple horizontal image lines.
Split the image buffer into N chunks with rayon's slice::par_chunks_mut or slice::par_chunks_exact_mut. Don't use nested parallel iterators and don't use fine-grained work like one-pixel-per-thread.
I'm not sure what the lock is for, either, but its existence is defeating the purpose of rayon.
If you can prove to the compiler that multiple mutable references are disjoint, then they can exist simultaneously (and concurrently) which is what the rayon::slice module does.
@parasyte awesome thank you... so kind of like this? can it be improved further?
pub fn render<T: Hittable>(&self, world: &T) {
let bar = progress_bar(self.image_height as u64);
let mut colors =
vec![Color::new(0, 0, 0); self.image_width as usize * self.image_height as usize];
colors
.par_chunks_exact_mut(self.image_width as usize)
.enumerate()
.for_each(|(j, line)| {
for (i, color) in line.iter_mut().enumerate() {
for _ in 0..self.samples_per_pixel {
let r = self.get_ray(i as u32, j as u32);
let sample_color =
self.ray_color(r, self.max_depth, world) * self.pixel_samples_scale;
*color += sample_color;
}
}
bar.inc(1);
});
let mut out = format!("P3\n{} {}\n255\n", self.image_width, self.image_height);
for color in colors.iter() {
write_color(&mut out, *color);
}
print!("{}", out);
}
My general supposition is that a single pixel is always more fine grained than the work for a single thread on modern hardware. Which tends to have a low number of usable CPU cores compared to the number of pixels in the image.
Blender's Cycles ray-tracing renderer uses configurable tile sizes for jobs [1]. With a GPU, using fine grained jobs starts making sense. SIMD on CPU would also be per-pixel, but you are greatly limited by available SIMD lanes. Perhaps 4 or 8 pixels at a time with SIMD? And auto-vectorization might do it, implying a loop over a coarse-grained workload.
You're right, it is not worth overthinking. Rules of thumb are a useful starting point, however.
Edit: It doesn't matter if it's a CPU job or GPU job, tiles are always available. According to 4 Easy Ways to Speed Up Cycles â Blender Guru smaller tiles are better for CPU, and larger tiles are better for GPU. Just FWIW - Cycles is an altogether different beast. âŠī¸
Makes me wonder if rayon or some wrapping crate has some "dynamic chunking" system where it tries increasing or decreasing the chunk size to see what size is optimal...
Yes, rayon should perform such "auto chunking". So even if your workload is tiny, like:
// Assume v is huge vector of integers
v.par_iter_mut().for_each(|x| *x += 1);
rayon should split the workload into just a few bigger chunks, and then use regular loop within each of these chunks (with all optimization that usually apply to loops, such autovectorization, etc.). So, usually, there's no need to manually chunk rayon's input. But if you want, rayon exposes with_min_len and with_max_len to tune the chunking behaviour. I've only ever used with_max_len = 1, when I knew my operation is heavy and I have only a few elements, thus I wanted rayon to always spawn a task per op.
There's one detail in OP's case though â bar.inc(1). If we'd rely on rayon's auto-chunking, this op will be called once per pixel, not once per row, so we'd be simply doing more work.