I am trying to make Rust just produce work in a thread, and count the amount of work done with a certain interval. The work to make Rust just run without anything taking it off cpu is essentially:
loop
{
_x += 1;
_x -= 1;
}
This is running inside a thread, so multiple threads will have these variables in their own scope. AFAIK variables are in cpu registers, so this code in a thread should not be influenced by other threads.
To count the amount of work done for each thread, I use a vector of with Arc<AtomicU65(u64))>.
By using the vector elements by thread number, I do assume it is also completely isolated from the other threads.
The question I have is if the threads that perform the work are truly independent, and do not influence each other via anything used in the code. The reason for asking is that when I measure the amount of work done on a multi-processor (AARCH64) system, there is a slight degradation of performance when I measure the performance difference between 1, 2, up to # core's threads. Obviously, with more threads than CPUs the per thread work goes down significantly.
Playground link for the complete code: Rust Playground
First, the loop you've shown doesn't need any threads (not even 1) since it's an infinite loop with no side effects. You can't make any performance judgements on code that doesn't do anything.
And you don't need atomics if you aren't actually sharing data. This can be safely done with just thread::scope
.
let mut counter_vector = vec![0; nthreads];
thread::scope(|scope| {
let mut threads = vec![];
for (nr, counter) in counter_vector.iter_mut().enumerate() {
threads.push(
thread::Builder::new()
.name(format!("cpu-eater-w-{}", nr))
.spawn_scoped(scope, move || {
//println!("threadid: {:?}", osthread::get_raw_id());
let mut loop_counter = 0;
loop {
loop_counter += 1;
if loop_counter == COUNTER_STEP {
*counter += 1;
loop_counter = 0;
}
}
})
.expect("spawning thread failed"),
);
}
for thread in threads {
if thread.join().is_err() {
println!("thread panicked");
}
}
});
Rust Playground
1 Like
Thank you, this is very useful.
What I try to do is run "something" that occupies CPU in a thread that is unlikely to block, for the idea of investigating cpu scheduling.
That something is the _x variable that is added and subtracted. It serves no function but to have it occupy some cpu. Because that (is supposed to have, which is why I asked) no function but also doesn't do anything that can block, it is likely to run on cpu infinitely.
However, to quantify the amount of work, in order to see the effect of the scheduling, I decided to add a counter per thread, and because of the speed of modern cpu's, that counts based on COUNTER_STEP.
However, to be able to see the work per thread, it needs to add the counts to something, which is the counter_vector.
Your suggestion of scoped threads works, but I am unable to implement the ctrlc::set_handler() function to pick up the counting that happens in counter_vector.
The general question that I have is if the addition to the local _x and loop_counter variables and the check of the COUNTER_STEP constant, as well as the access to the vector and each thread using Arc<AtomicU64()> independently for the thread's vector element is able to directly influence each other, and thus on a 2+ cpu system the amount of work for a single thread for both running 1 thread and running 2 threads should be the same.
I currently see slightly less work done for each thread when I add threads.
Your "something" is nothing. Not only is the +1-1 equivalent to nothing, the variable is never read, so rust will completely optimize it out. Even if it didn't, it's such a small computation that it doesn't matter.
All you're doing is measuring the counter's worst case scenario. Trying to figure out how performant your counter is without an accompanying workload is useless. Sometimes, a more expensive counter can become cheaper when paired with a real workload. You need to try your actual workload if you want to get any meaningful information.
2 Likes
Thank you.
Point taken, what I do is too simple, and can be optimised out.
So, it's just increasing the loop_counter, and then add one to the array.
All I want to accomplish is make the thread run "something" that doesn't call anything, doesn't load anything from main memory, and therefore just consumes CPU for some time that is consistent, and therefore counting it would give an impression of the amount of time the thread spent on CPU.
Would another loop counting from 0 to 1000 before the loop counter be something that is consumes CPU?
Again: I am trying to force consistent work from Rust, not useful work, for the sake of being able to test the linux scheduler by being absolutely sure what Rust does.
I actually tested on godbolt and the disassembly seems to actual perform the addition and subtraction: Compiler Explorer
(10622: inc eax, 10630: dec eax)
You need to compile it with optimizations.
If you want some work, do something that can't be optimized, like computing primes and sinking them into black_box
.
1 Like
In what ways are you testing the scheduler? Threads can be preempted so it doesn't really matter what work they do (even sleeping). The objective should probably be better specified to avoid confusion and wasting time trying to understand what you actually want to accomplish.
In what ways are you testing the scheduler? Threads can be preempted so it doesn't really matter what work they do (even sleeping). The objective should probably be better specified to avoid confusion and wasting time trying to understand what you actually want to accomplish.
I want the Rust program to be doing two things:
- Be running without doing anything that would make it halt or block or would otherwise give the scheduler any reason to put it off the CPU. That is why I try to do something in a loop that is letting the CPU perform work using what I am pretty confident of using registers only.
- Make the running be performed so that it's taking a consistent amount of time/CPU. This way, I can see if it's performing the exact same amount of work when changing the number of threads. Going from 1 to 2 threads on a multiprocessor system, the amount of work should double.
I should say that I am using bpftrace on the operating system side with tracepoints on getting onto and off cpu, so that I can actually measure on-cpu time.
I am nearly there.
If I run a single thread, I can see that recent linux kernels can let a task/thread run for up to 4 seconds uninterrupted. With two threads with much more CPUs available, this also is true, however the amount of "steps" per thread gets slightly less.
Once you get the number of threads equal or past the number of cpus, the slice time seems to get around 10ms. Roughly. It seems that there can be a lot of reasons to push a user task off cpu, and despite the minimum slice time in linux of 1.5ms, the test program has slice times of 4 to 12500 microsoconds too.
I was wrong again about the black box: what is visible on godbolt is unoptimised code. If I change the addition and subtraction to:
_x = black_box(black_box(_x) + black_box(1));
_x = black_box(black_box(_x) - black_box(1));
I notice the loop counting goes way slower, so that's an indication that it might indeed have been optimised out in optimised code.