How to benchmark multi-threaded method? Is Criterion only useful for single-threaded code?

janriemer · December 5, 2020, 6:39pm

Dear Rustaceans,

do you have any tips on how to "reliably" benchmark a multi-threaded method in Rust?
I'd like to benchmark something like the following (actual logic is not important here - I only want to show the "multi-threadedness" to give you an idea):

Link to playground

use rayon::prelude::*; // 1.5.0

// this function should be benchmarked
fn sum_parallel(nums: impl ParallelIterator<Item=u64>) -> u64 {
    nums.reduce(|| 0u64, |a, b| a + b)
}

fn main() {
    let sum = sum_parallel((0..10_000u64).into_par_iter());
    println!("{}", sum);
}

I've already tried Criterion, which works really well, but here comes the catch:
When I try to benchmark my method with Criterion, it reports a significant higher time (~ x1.5) for the measured method, than when I benchmark it "manually" directly in my code (with std::time::Instant::now() and .elapsed()).
My assumption is the following: My machine has two physical cores (and four logical cores). When I benchmark with Criterion, Criterion itself uses at least one physical core, so the method under test can only run with a single physical core, which basically turns it into a single-threaded method.

Can I maybe configure Criterion for this scenario? Or is it not the right tool for this (I know it is intended to be used for microbenchmarks).

Do you have any suggestions for other tools or approaches?

Thank you!

kornel · December 6, 2020, 10:51pm

It's very easy to get misleading results from bare Instant::now(), so Criterion is more likely to give you the true result here.

There are many situations in which multi-threaded code may really be slower than single-threaded code:

rayon's management of multi-threaded queues is not free. If your task is very small/quick, the cost of dividing the work between the threads may be significant compared to the work itself. Use multithreading with large datasets/big jobs only. Use with_min_len() in rayon.
if your individual task is basic arithmetic like a+b, then splitting work into individual items makes you miss out on SIMD operations. It's faster for 1 core to perform 1x16 SIMD operation than 16 cores to perform 1 operation. Having loops inside your rayon tasks can give you best of both.
False sharing. If multiple threads read and write to nearby addresses in memory (that share a cache line), then it requires the CPU to synchronize cache between cores, making memory accesses very slow and behave almost like single-threaded. Make sure the work is separated into large chunks, and ideally the chunks should be cache-line-aligned.

janriemer · December 13, 2020, 3:10pm

Hey @kornel,
sorry for my late reply.
Thank you for your elaborate answer. I've tested it again using 100_000_000 elements (!) in the collection (and doing |x| x * 2) and it shows a 40% speedup for the parallel version vs. non-parallel version when measured with Criterion.

So you are right: Criterion gives the correct results even for multi-threaded code.

It was probably my mistake not having large enough chunks.

I didn't know about multiple threads reading and writing to nearby addresses in memory causes slowdown due to synchronization overhead. Interesting...

Thank you for all the info. I've learned a lot!

Edit on 30.03.2021:
Regarding False Sharing (I didn't know this was an actual term), I've found this great Reddit answer.

In it, a video with exact timestamp is linked, where False Sharing is explained:

@kornel FYI

system · March 13, 2021, 3:10pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How to benchmark multiple versions of a library (with different compile-time constants) against each other? help	8	740	February 21, 2020
Benchmark Question help	4	196	December 2, 2023
Is My Criterion Benchmarking Code Actually Answering My Question? code review	6	1017	March 10, 2022
Why multithreaded approach is slower than single-threaded straightforward approach? code review	3	1360	September 16, 2023
Problem with criterion benchmarks help	11	1174	July 2, 2020

How to benchmark multi-threaded method? Is Criterion only useful for single-threaded code?

Related Topics