I'm non-English speaker so if I don't make myself clear, feel free to point it out. Thank you very much.
Considering making a multi-thread binary which requires high-performance. The common idea is to set the thread numbers to the cores number of CPU. Sometimes on hyper-threads CPU we can set to 2x. This should be the way during normal situation?
But I've been thinking of that when our program is running, there must be many OTHERS progresses running too. And also some of them are multi-thread. So, can I believe that in every "unit" of time, there are more threads than cpu cores are running.
Under this circumstance, how can we arrange numbers of threads our program use? Though I know the answer could be "CPU cores", but why?
I'm not very familiar with kernel theory of OS or CPU. So maybe this is a stupid question.
As far as I know, on Linux at least, the threads in your program will get allocated time in which to run on a more or less equal basis as other processes on your machine and whatever threads they have internally. So basically if your machine has a lot of work to do elsewhere the performance of your threads will suffer.
One can take steps to change this:
On Linux the nice command can be used to run your program with a higher priority, in which case one gets more processor time for ones threads.
On a multi-core machine one can prevent the Linux kernel from running processes on one or more cores. Then one can runs ones own process on those "reserved" cores. See the isolcpus kernal boot parameter. This way ones threads can have 100% useage of a core with no interference from other processes. Use the Linux taskset command to run ones processes on those reserved cores.
Thank you for your reply!
So as my understanding, whatever how manu threads running at the same time, they will be allocated with limited times for CPU to execute and switch to next one?
But still I have the question: how can we measure a progress's thread number? For example if I set it to 3x of CPU cores, then what will happen? (I think it will cause loss of performance, but why?)
Perhaps to simplify the discussion, forget about multi-core CPU and hyper threading. Let’s think about a single core CPU. The same situation happens when there are many processes (or threads) that need to make progress, but only one can make progress at any point in time (I.e., the one currently being executed by the CPU). That is called concurrency, which is different to parallelism. So, you can imagine that, on this single core CPU, 3 concurrent processes would all make progress faster (on average) than 100 concurrent processes. The entity in charge of deciding which process/thread makes progress is called the scheduler, which is one part of many operating systems.
There are different scenarios to consider. For example:
One has a few threads that do a lot of compute work and little waiting on I/O. In that case one might want to have one thread per core to get the maximum work done out of each core.
Perhaps those threads are compute intensive but do actually spend 50% of their time waiting on input. In that case one has 50% idle time if one does as in 1). So it would be better to put 2 threads in each core.
Perhaps, with hyper-threaded cores it turns out to be better to run 2 compute intensive threads on a core to make maximum usage of the cores resources.
Of course ones actual application work load and I/O demands will be somewhere on a spectrum between those simply stated scenarios.
At the end of the day I think one will have to tune ones solution to find the optimal number of reserved cores, threads per core etc to match the processor architecture, available cores to the demands of ones application. Measure the performance of your application, tweak things, measure again.
So the exact work of every threads do could lead different strategies. I have not considered this situation.
All in all, seems that there is not a "typical" answer of my question exists, it depends on the percentages of IO and compute?
There is no typical answer. It all depends on the I/O and compute your application needs to do. The work loads it actually sees in use.
One scenario I did not describe is the case where one has hundreds or thousands of threads that have little compute work to do and spend a lot of time waiting on I/O. So called "I/O bound tasks". In that case it might be better not to use threads at all and use async tasks instead. The idea being that with little actual compute work to do the overheads of all that thread scheduling by the kernel become significant and using a async executor to switch between async tasks is more efficient. Not to mention saving a lot of memory by not having to create lots of threads and their stacks.
How well async actually achieves this I don't know. I have heard many arguments about how it does or does not help. Sadly evaluating async vs normal threading for ones actual application is a lot harder as it requires major changes to the application to find out.
What the OS tends to do is run each thread for a short time called a "slice" or "quantum" (for the meaning "unit of quantity", not physics!) then go to the next "ready" thread - that is one that isn't waiting for something that hasn't happened yet.
Which thread specifically gets it is often based on a mix of multiple strategies, for example:
Round robin: the simplest option, have an order for each thread to be run that you just loop around. The most "fair", as each thread gets an opportunity to progress, but can waste a lot of time progressing "unimportant" threads or waking a thread that just immediately finds it's blocked by something else.
Priority: threads have an priority value explicitly set by a user or program, or implicit one set by the OS: higher priority threads run first. Good when you have actual time constraints like media playback, or ensuring that the user interface stays responsive, but needs to be carefully designed and ensure that higher priority threads that are always ready don't just run continuously without giving lower priority threads a chance to ever run. In the worst case, a high priority thread is continuously checking to see if it can take a lock held by a low priority thread, and ends up blocking itself!
Locality: a thread that was recently running on a processor core probably still has it's instructions and data cached on that core, so the OS should try to keep it on that core to some extent. A common approach to this is "work stealing": run the general approach separately for each core, spreading new threads evenly, but when one needs more work (most obviously because everything is blocked), take work from some other core that has too much work. Generally a really big win, but can be hard to balance what counts as "needing more work"
There's a lot of ongoing work figuring out new strategies, ways to combine them, and tuning the existing ones to improve performance and responsiveness. The best you can do is often just pick a number of threads, often the number of cores or a small factor of it, so that you always have work ready to run. Note in particular boosting thread priority is often a bad idea: it doesn't make the CPU work any harder, it just means other code runs less, and that other code might be your development tools