From coming across WSS alot, I understand that this is a popular sheduler choice. For example in tokio. I just can't find out why this is used rather than a model where there is just one central queue, from which every thread gets a task when they are ready to do some work.
It seems to me that letting every thread have their own queue introduces the need for load balancing and work stealing which seems alot of hassle if you could just take one task at a time from a central queue.
massive amount of very small tasks, most with very similar execution time
one execution queue, prefilled with some fair (> 16) amount of tasks (maybe not all)
In this situation even in first iteration all threads will try to take first item from queue, but only one will succeed. The other 15 will fight for second task, then 14 will try to take third one, and so on. Even if fitst task is preasigned to thread, this will happen on second task (because all threads will finish its job in almost same moment). If tasks are very small its even possible, that when some threads are still trying to find something to work on (they are fighting for work on 13th item in queue), first two threads already done, and joined the fight! In practice instad of doing actual work, some threads are fighting for job to do. If tasks are small, and there are many threads, this may happen often.
Obviously it is also possible to create queue per thread, and don't steal any job. It will work pretty well, but the problem is if one thread will get jobs which are little smaller. If jobs are only a little smaller the thread will still include one queue problem (remember, that taking item for non-blocking queue still takes time), but when cumulated, there is some time, when not all calculations are yet finished, but one thread is doing nothing, because it done its job. This is starvation problem.
Stealing scheduler is trade off between those two aproach.
Ok, so yes contention needs to be dealt with, but I would imagine something like:
Threadpool has a sink for each thread in the pool.
A channel for each thread to send tasks
each thread has a stream for incoming tasks
When starting threadpool takes the tasks and sends one to each thread. Now each sink is NotReady. When thread sends back the result of the computation, the sink gets woken up and takes the next task from queue.
In this design, yes there might be some waiting time for the threads while threadpool is dispatching tasks, but at least the access of the queue is in a single thread, so it's lock free.
Would this be worse performance wise then organizing the work stealing with thread syncronisation? It seems a much simpler design than work stealing...
I suppose the limitation here is that each task is dispatched one by one?
Work stealing has some significant cache benefits:
It keeps all the children task of some subtask on the same core, with the assumption that if task A spawns task B, then they are more likely to share memory (or at least be close) than task A and some distant task Z
They generally operate on a stack model, which usually is much friendlier to the cache and child tasks than queues. Imagine recursively splitting an array with work stealing, where there is a lot of locality benefit to using a stack model with spawned tasks and best-effort keeping tasks on the same core.
This is possible for small amount of workers and unbalanced tasks size, otherwise you have all disadvantages of having single centralized queue (most of workers are doing nothing but waiting for tasks - no matter if they are actively fighting for it, or waiting for some all-time-busy-dispatching task).
Work-stealing queue scheduler is developed and optimized for traditional fork-join concurrency model, where each task can spawn zero or more subtasks, which can possibly be run in parallel. Its main benefit is that WSS pays synchronization cost only when needed, while central-queue needs to pay it for every tasks.