If I interpret it correctly, we accidentally run in to a depth-first scheduling, where the inner
par_iter ‘trap’ threads, preventing the outer
par_iter from generating new splitter workloads.
Looking at the code, and how cuviper explains the deadlock happening, I see two general strategies for a way out, although I am completely unsure how feasible they are.
First one would be to make the Rayon scheduler somehow aware of the “depth” of the
par_iter calls, and have it prioritize the shallowest/outermost par_iter first.
Then, all possible jobs will always have been queued before the locking happens.
However, I fear this would require some form of spooky action at a distance ((expensively) inspecting the stack? Some secret Rayon local storage?).
This scheduling change also wouldn’t work universally, what happens in a different program where the lock-trapping happens in the outer par_iter?
The next strategy would be to give each par_iter its own queue, plus at least one single dedicated thread. So any callsite is guaranteed to have at least one thread available.
In this instance, that would trap almost all threads on the inner lock, but the one reserved outer-thread would keep chugging along, releasing all the locks, one at a time, eventually.
Deadlock avoided at the cost of parallelism.
Another downside is that that thread will be wasted when that queue is empty, but others are still full.
Now that I’ve written them out, I dislike both my suggestions, because they seem to add a lot of complexity to the current, elegantly simple, Rayon model, and they incur costs even when they aren’t needed.
I hope the rough ideas can spark better ideas