I have given Raytracing in a Weekend a go, and added a few of the features from Raytracing: The Next Week as well. Since I don't plan to develop anything more advanced in terms of Raytracing features at this moment I decided to add parallelism. I implemented a Threadpool as per the Rust Book to be able to start multiple workers and send them jobs to render via mpsc channels. Since the task at hand is trivially parallelizable I expected the speedup to be linear with the number of threads or at the very least cores on my system. I am using a PopOs 22.04LTS Ryzen 5950x build so there should be plenty of room for speedup. Unfortunately the results are underwhelming to say the least. Using all 32 threads I get about a 5x speedup compared to the single threaded version. I would like some pointers if possible to what could be the issue. I used perf to get the some stats and here are some observations.
The most time consuming function is the ray_color in both single and multithreaded release builds
The most time consuming function is rand_chacha::guts::refill_wide in single and multi threaded debug builds
The multithreaded (release) has 30% miss rate vs 0.3% on the single threaded version (this tells me there is some false sharing somewhere but not sure where or how to detect it)
The multithreaded (debug) has the same low (0.3%) miss rate as the single threaded version.
Playing with the chunk size did not change the issue thus leading to the conclusion that all threads are mostly saturated at all times and the workload is balanced.
I suspected that the random number generation could cause cache coherency issues as mentioned in this video but I am using the thread_local! macro which is supposed to alleviate this issue. Any help is appreciated. The code can be found here.
I would suggest trying out using rayon instead of a custom thread pool. rayon is the de-facto standard library for efficiently executing parallel compute code in Rust. It
Implements work-stealing, which can improve performance by reducing the overall scheduling overhead (no communication is necessary until some thread finds that it has nothing to do) and reducing the need to pre-define data chunk sizes (because tasks are dynamically split when there are too few to occupy the thread pool).
Allows the parallel tasks to borrow input data, so you don't have to clone it for each task.
Thanks I will look into that. I am mostly doing it manually for learning purposes. So figuring out what the issue is would be a valuable lesson
Edit: Doing a quick Rayon test I get the same speedup of around 5x which suggests that there is something in the task itself that is the bottleneck rather than the Thread-pool implementation.
My experience with Rayon as well as with custom thread pools is that if jobs are small and you have more workers running than available jobs, for those periods of time, all the workers in contention with each other (for example, all workers listening to the same channel and other things) can also burn some noticeable extra cpu power. With that said, Rayon has fixed or improved several many such cases during its accumulated history.