Hi,
I have an algorithm that runs the same function on hundreds of millions of inputs (given as a Range), and I am trying to parallelise it efficiently. The runtime of the function is inhomogenous depending on the input, and cannot be predicted efficiently. So any static distribution of the workload will be inhomogenous as well.
I am currently using a parallel rayon iterator with static chunking, i.e. iterate over inputs in chunks of size 10_000. This works well, but towards the end of the algorithm, those chunks that take longer than others will keep a few processors busy longer than needed, while the others are already idle.
If I make the chunks smaller I see a lot of red bars in htop, i.e. overhead from locking between the threads.
Now I could try to search for the "optimal" chunk size, but I want my algorithm to run on different machines and inputs with different runtime characteristics, so this is not possible.
Instead, I would like an iterator (or any other abstraction) which allows me to pass a Range as an input and processes it in parallel with my function. It should pass chunks of the input Range to each thread, and dynamically optimize the chunk size to keep it small, but also keep the overhead from locking close to zero.
Is there a feature in rayon I have not found that does this? Or does anyone know of a different crate that does this?