Hello everyone. I'm currently designing a brush engine for which the goal is to take arbitrarily defined node graphs and execute them with parallelism in mind. Think something like
for a simple example.
While it works decently well (and outperforms some industry standard programs!) the main bottleneck seems to be coming from the overhead in the task scheduler. The canvas is organized in tiles and since many operations are "tilewise" (eg. blending two layers can be done in pairs of tiles from each) this makes the task a great fit for parallelism. Tiles usually contain 4kb of pixel data (64 x 64 x RGBA 8 bit channels) which makes every individual tilewise task tiny.
I've been trying so far to make my scheduler as lightweight as possible but I might be missing something really obvious (or maybe not so obvious) because when I try to scale past ~12 threads performance starts to decrease greatly, so much so that I lose any gains I got from multithreading by the time I get to 24 threads (how many logical cores I have and what Krita defaults to for parallelism on my machine.)
I'm assuming this has to do in some way with contention on cache lines or false sharing. I've tried to eliminate as much of it as possible, and it did work pretty well for the most part, but I'm reaching the limit of what I can figure out on my own.
I did try using standard library features at first, but I found that many of them would make my threads yield back to the OS scheduler and that was simply way too much overhead. I do use standard threads though (I don't think I know of any fitting alternative for that anyway.)
Are you referring to some specific place where I'd use standard channels?
I suggest using a profiler to narrow it down. cargo flamegraph is one easy first step. I've started using (on Linux) perf record -F 9999 --call-graph dwarf,32768 <exe_name> and then hotspot to analyze the output, as hotspot provides more views of the output, in addition to flamegraphs.
When the problem is fetching memory into CPU caches, this will show up as a simple memory access with an unusually high cost. I've found several such problems this way.
I'm not sure how an assert takes so much longer to do than iterating through an entire list. Maybe the debugging symbols got shuffled during optimizations. In the slice I took my actual pixel blending functions are taking up ~15k samples for a frame of reference.
I thought this could be a bottleneck where workers would steal back and forth from each other but I set a limit and now it's taking even longer (trying to keep the amount of samples doing actual work at 15k in my slice still)
All in all across my stuff the main thread is spending a lot of time spinning, around 15k samples in total as well, 5k of which in outer_backoff.spin() (current queue is empty, couldn't steal from other workers, but there are still tasks that remain). I figured that maybe I should try to do work first before trying to steal from other workers, so I swapped the work and steal loops, but that made things seemingly even worse and the ratio became 15k/7k.
10k backoff.spin() samples were waiting for the other threads once the main thread has finished its work.
it seems like every single other thread was at that same time spinning after having set its state back to WAIT. The maximum duration of a spin call should be 2^7 pause instructions, each apparently taking up to 64 cycles on my CPU. 128 * 64 clock cycles is nowhere near the ~1ms time the profiler is reporting for this spin call.
I think I might be reaching the limits of what sampling profilers can do for me, I might try amd uprof to get a better idea of what's going on...
Turns out my scheduler didn't have much of an issue aside from me not considering the topology of my CPU. Once I set the thread count to a number I'm computing based on info from hwlocality, it all went a lot better
Hi, I'd love it if you had a look at my library that I recently published.
It is designed for arbitrary recursive graph computations, and has baked in parallelization (optimized heavily, benchmarked) that rivals rayon strongly. The parallel executor lets you configure if the work-stealing happens per-thread or sharedly, and you also gain control over other properties of parallel execution. You can execute the same very thing sequentially or parallely, depending on your latency needs.