How to benchmark multi-threaded method? Is Criterion only useful for single-threaded code?

Dear Rustaceans,

do you have any tips on how to "reliably" benchmark a multi-threaded method in Rust?
I'd like to benchmark something like the following (actual logic is not important here - I only want to show the "multi-threadedness" to give you an idea):

Link to playground

use rayon::prelude::*; // 1.5.0

// this function should be benchmarked
fn sum_parallel(nums: impl ParallelIterator<Item=u64>) -> u64 {
    nums.reduce(|| 0u64, |a, b| a + b)
}

fn main() {
    let sum = sum_parallel((0..10_000u64).into_par_iter());
    println!("{}", sum);
}

I've already tried Criterion, which works really well, but here comes the catch:
When I try to benchmark my method with Criterion, it reports a significant higher time (~ x1.5) for the measured method, than when I benchmark it "manually" directly in my code (with std::time::Instant::now() and .elapsed()).
My assumption is the following: My machine has two physical cores (and four logical cores). When I benchmark with Criterion, Criterion itself uses at least one physical core, so the method under test can only run with a single physical core, which basically turns it into a single-threaded method.

Can I maybe configure Criterion for this scenario? Or is it not the right tool for this (I know it is intended to be used for microbenchmarks).

Do you have any suggestions for other tools or approaches?

Thank you! :heart:

It's very easy to get misleading results from bare Instant::now(), so Criterion is more likely to give you the true result here.

There are many situations in which multi-threaded code may really be slower than single-threaded code:

  • rayon's management of multi-threaded queues is not free. If your task is very small/quick, the cost of dividing the work between the threads may be significant compared to the work itself. Use multithreading with large datasets/big jobs only. Use with_min_len() in rayon.

  • if your individual task is basic arithmetic like a+b, then splitting work into individual items makes you miss out on SIMD operations. It's faster for 1 core to perform 1x16 SIMD operation than 16 cores to perform 1 operation. Having loops inside your rayon tasks can give you best of both.

  • False sharing. If multiple threads read and write to nearby addresses in memory (that share a cache line), then it requires the CPU to synchronize cache between cores, making memory accesses very slow and behave almost like single-threaded. Make sure the work is separated into large chunks, and ideally the chunks should be cache-line-aligned.

2 Likes

Hey @kornel,
sorry for my late reply.
Thank you for your elaborate answer. I've tested it again using 100_000_000 elements (!) in the collection (and doing |x| x * 2) and it shows a 40% speedup for the parallel version vs. non-parallel version when measured with Criterion.

So you are right: Criterion gives the correct results even for multi-threaded code.

It was probably my mistake not having large enough chunks.

I didn't know about multiple threads reading and writing to nearby addresses in memory causes slowdown due to synchronization overhead. Interesting... :open_mouth:

Thank you for all the info. I've learned a lot! :heart:

Edit on 30.03.2021:
Regarding False Sharing (I didn't know this was an actual term), I've found this great Reddit answer.

In it, a video with exact timestamp is linked, where False Sharing is explained:

@kornel FYI

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.