Understanding performance loss while performing simple copy kernel operation

Hello Rustaceans,

I am currently writing a HPC benchmarking suite in Rust with simple kernels like copy, update, triad, daxpy, sum etc. The mail goal of this suite is to saturate the memory bandwidth and see how much memory bandwidth each kernel can use. The original benchmarking kernels were written in C with OpenMP for parallelism.

For example, the Copy kernel in C with OpenMP (static scheduling) is written as:

double copy(
        double * restrict a,
        double * restrict b,
        int N
        )
{
    double S, E;

    S = getTimeStamp();
#pragma omp parallel for schedule(static)
    for (int i=0; i<N; i++) {
        a[i] = b[i];
    }
    E = getTimeStamp();

    return E-S;
}

And I am trying to migrate such benchmarks in Rust with Rayon. Below is the best performing kernel I could write in Rust with Rayon.

pub fn copy(c: &mut [f64], a: &[f64], n: usize) -> f64 {
    let c_iter = c.par_chunks_mut(n);

    let s = Instant::now();

    // Parallel version
    c_iter.for_each(|c_slice| {
        c_slice
            .iter_mut()
            .enumerate()
            .for_each(|(i, val)| *val = a[i])
    });

    s.elapsed().as_secs_f64()
}

pub fn main() {
    let threads = available_parallelism().unwrap().get();

    let vec_a: Vec<f64> = (0..120_000_000).into_par_iter().map(|_| 2.0).collect();
    let mut vec_c: Vec<f64> = (0..120_000_000).into_par_iter().map(|_| 0.5).collect();

    let t1 = copy_1(&mut vec_c, &vec_a, 120_000_000 / threads);

    println!("Copy 1 took : {t1} sec");
}

The main point of getting #threads and dividing the problem size is to give each thread contiguous equal amount of work. Like, thread #0 gets first 0..x elements, thread #1 gets the next contiguous x+1..y elements and so on. This is how OpenMP static scheduling works.

I used these flags in ./cargo/config.toml:

[x86_64-unknown-linux-gnu]

rustflags = [
    "-C",
    "target-cpu=native",
    "-C",
    "llvm-args=-ffast-math",
    "-C",
    "opt-level=3",
]

My problem is not being able to extract the same performance as the C version.
I am using Intel SapphireRapids 8470 ; dual socket with 104 cores and max theoretical memory bandwidth of 600 GB/s.
The C version with OpenMP takes on avg 0.0052 seconds:

Function      Rate(MB/s)  Rate(MFlop/s)  Avg time     Min time     Max time
----------------------------------------------------------------------------
Copy:         370203.49    -             0.0052       0.0052       0.0053
----------------------------------------------------------------------------

But the Rust version with Rayon takes on avg 0.0083 seconds:

Function      Rate(MB/s)  Rate(MFlop/s)  Avg time     Min time     Max time
----------------------------------------------------------------------------
Copy:         247803.58    -             0.0083       0.0077       0.0086 
----------------------------------------------------------------------------

Ofcourse it does not seem fair to compare C+OpenMP with Rust+Rayon.
But I want to understand where I am losing performance. My belief is that if C can achieve such performance, so can Rust.

I have also tried multiple manual versions of trying to vectorize the code with AVX512 instructions in unsafe Rust, manual loop unrolling etc. But nothing gets past this barrier of 0.0077 seconds in Rust.

My questions are:

  1. Is this performance overhead from rayon ? From flamegraphs, I could not see much of overhead from rayon functions.
  2. Am I missing something while writing the parallelized kernel that could potentially stop me from extracting the same performance as C ?

Thank you in advance to the Rust community. =)

1 Like

Quick sanity check: have you tried just use copy_from_slice?

1 Like

Yes. =)
Below is the parallel version I had tried 2 days ago:

pub fn copy(c: &mut [f64], a: &[f64], n: usize) -> f64 {
    let c_iter = c.par_chunks_mut(n);
    let a_iter = a.par_chunks(n);

    let s = Instant::now();

    // Parallel version
    c_iter.zip(a_iter).for_each(|(c_slice, a_slice)| {
        c_slice.copy_from_slice(a_slice);
    });

    s.elapsed().as_secs_f64()
}

Performance for this kernel is lower than the kernel in the 1st post.

----------------------------------------------------------------------------------------------------------
Function        | Rate(MB/s)      | Rate(MFlop/s)   | Avg time       | Min time        | Max time        |
----------------------------------------------------------------------------------------------------------
Copy:           | 174110.97       | -               | 0.0116         | 0.0110          | 0.0122          |
----------------------------------------------------------------------------------------------------------

I'm not to sure if this would make the difference, but in your original C + OpenMP uses a single for loop to index and copy values from one array to the other. Where as in your Rust + Rayon program the iter + enumerate actually creates two increments (one for the iter and another for the enumerate. You might want to try creating a special iterator that internally only counts index for two slices. Like a specialized zip.

Or just hand write it like you are doing in your C version.

1 Like

Give this a try. I am on my phone, so I can't test it right now.

pub fn copy(c: &mut [f64], a: &[f64], n: usize) -> f64 {
    let c_iter = c.par_chunks_mut(n);

    let s = Instant::now();

    // Parallel version

    c_iter.for_each(|c_slice| {
        for i in 0..n {
            c_slice[i] = a[i];
        }
    });

    s.elapsed().as_secs_f64()
}

pub fn main() {
    let threads = available_parallelism().unwrap().get();

    let vec_a: Vec<f64> = (0..120_000_000).into_par_iter().map(|_| 2.0).collect();
    let mut vec_c: Vec<f64> = (0..120_000_000).into_par_iter().map(|_| 0.5).collect();

    let t1 = copy_1(&mut vec_c, &vec_a, 120_000_000 / threads);

    println!("Copy 1 took : {t1} sec");
}

Indexing with [i] in Rust is often pretty bad and tanks performance. It is not the same operator as in C/C++.

This is because every use of indexing has a mandatory bounds check, which has a panicking branch on failure. LLVM is not always able to eliminate the checks, especially if the length comes from outside of the function like your n. When the check + panic is left in, it prevents autovectorization and any loop simplifications, because for LLVM it's a very important side effect that cannot be ignored or even moved. So even though branches aren't that expensive in modern CPUs, having a panicking branch in code is super expensive due to destroying optimizations around it.

Anything other than [i] will almost always be faster. Use iterators and built-in functions where available.

If you're forced to use [i], then first do let slice = &slice[..n] just before the loop for every slice indexed with 0..n. This will usually move bounds checks from inside the loop to before the loop.

And instead of arr[i], it's less terrible to have arr.get(i).copied().unwrap_or_default(), because if there is a bounds check, it behaves predictably without global side effects, and LLVM knows how to optimize it.

9 Likes

Try running the rust and C programs under perf stat and perf record -e cycles and then compare CPU metrics and the hottest assembly, respectively. Maybe it's having issues with the cache hierarchy, or more branchy, or the C version vectorizes differently.

    c_iter.for_each(|c_slice| {
        c_slice
            .iter_mut()
            .enumerate()
            .for_each(|(i, val)| *val = a[i])
    });

Try using zip here to iterate over both c and a with a single iterator. zip is designed to eliminate bounds checks for those cases.

1 Like

Thanks, this is good to know.

Did you mean let slice = &slice[..n] before indexing with i where i < n?

Yes

You are right. Enumerate did add a little bit of overhead, but it was negligible. I just saw in the flamegraphs and made sure that performance is not affected by the overhead from the enumerate function().

Thank you for the inputs. I have tried to let x = &mut x[..n]; approach before for bound checking elimination. But it made no difference in my case. The performance was still the same with and without the syntax for bound elimination. Let me show you that it still vectorizes without this syntax in my next comment.

Thanks for you input. I tried a bit of a different approach and found that the C code is vectorized and my Rust code is not. I used LIKWID tool to see the instruction count for both the versions. Basically LIKWID can directly use the hardware performance counters and count the number of events.

For C + OpenMP, the LIKWID output with instruction count was:

+-----------------------------------------------+----------+---------------+------------+-------------+--------------+
|                     Event                     |  Counter |      Sum      |     Min    |     Max     |      Avg     |
+-----------------------------------------------+----------+---------------+------------+-------------+--------------+
|             INSTR_RETIRED_ANY STAT            |   FIXC0  |  199122767147 |  775646944 |  2905685112 | 1.914642e+09 |
|           CPU_CLK_UNHALTED_CORE STAT          |   FIXC1  |  295831515073 | 1021681044 |  3379342350 | 2.844534e+09 |
|           CPU_CLK_UNHALTED_REF STAT           |   FIXC2  |  215895827280 |  721617120 |  2348666480 | 2.075921e+09 |
|               TOPDO.WN_SLOTS STAT              |   FIXC3  | 1774989090438 | 6130086264 | 20276054100 | 1.706720e+10 |
|              PWR_PKG_ENERGY STAT              |   PWR0   |      800.6302 |          0 |    406.2085 |       7.6984 |
|              PWR_DRAM_ENERGY STAT             |   PWR3   |       30.7188 |          0 |     16.0652 |       0.2954 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE STAT |   PMC0   |          1044 |          1 |          15 |      10.0385 |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE STAT   |   PMC1   |         18228 |         30 |         814 |     175.2692 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE STAT |   PMC2   |    3120000000 |    4617002 |   148846200 |     30000000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE STAT |   PMC3   |             0 |          0 |           0 |            0 |
+-----------------------------------------------+----------+---------------+------------+-------------+--------------+

We can see that FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE is not 0. Hence the code was vectorized.

For Rust + Rayon, the LIKWID output with instruction count was:

+-----------------------------------------------+----------+--------------+------------+-------------+--------------+
|                     Event                     |  Counter |      Sum     |     Min    |     Max     |      Avg     |
+-----------------------------------------------+----------+--------------+------------+-------------+--------------+
|             INSTR_RETIRED_ANY STAT            |   FIXC0  |  47805474138 |  355067083 |  3080404377 | 4.596680e+08 |
|           CPU_CLK_UNHALTED_CORE STAT          |   FIXC1  | 126004217733 |  811286174 |  2468771365 | 1.211579e+09 |
|           CPU_CLK_UNHALTED_REF STAT           |   FIXC2  | 100640987520 |  815106560 |  1600544880 | 9.677018e+08 |
|               TOPDOWN_SLOTS STAT              |   FIXC3  | 756025306398 | 4867717044 | 14812628190 | 7.269474e+09 |
|              PWR_PKG_ENERGY STAT              |   PWR0   |     525.2459 |          0 |    269.9788 |       5.0504 |
|              PWR_DRAM_ENERGY STAT             |   PWR3   |      28.1280 |          0 |     14.5840 |       0.2705 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE STAT |   PMC0   |   4199369331 |   32590588 |    43707804 | 4.037855e+07 |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE STAT   |   PMC1   |   4091168761 |   29580696 |   514584836 | 3.933816e+07 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE STAT |   PMC2   |            0 |          0 |           0 |            0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE STAT |   PMC3   |            0 |          0 |           0 |            0 |
+-----------------------------------------------+----------+--------------+------------+-------------+--------------+

We can see that FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE is 0, which means it wasn't vectorized at all.

It is very funny that if I use my same kernel on godbolt compiler, it vectorizes the kernel.

But the compiler/LLVM does not vectorize the kernel on my machine even though I have used:

[x86_64-unknown-linux-gnu]
rustflags = [
    "-C",
    "target-cpu=native",
    "-C",
    "target-feature=+fma,+avx2,+avx",
    "-C",
    "llvm-args=-ffast-math",
    "-C",
    "opt-level=3",
]

Btw, I did mention here that zipping is degrading the performance in my case.

I will have to figure a way out to vectorize the code. At least find the reason why isn't it vectorizing on my machine and does on the godbolt website.

You're using target-cpu=native, which will differ from machine to machine.

Sidenote: using llvm-args=-ffast-math makes your entire program unsound.

I meant for the inner loop, not the outer one. But anyway, the inner loop looks like a memcpy. So that could already be vectorized if llvm is calling into your libc's memcpy. Ah, right, your libc can also matter.

You could also try lowering codegen-units to 1 and/or enabling LTO.

I have tried to use target-cpu=sapphirerapids. But the performance is the same.

However, I might have been wrong before about LLVM not vectorizing the code. It may also be that LIKWID data was not correct with rust binary.

Using this command below:
RUSTFLAGS="-C target-cpu=native -O -C llvm-args=--pass-remarks=vectorize -C llvm-args=--pass-remarks-analysis=vectorize -C llvm-args=-ffast-math" cargo b --release

I was able to get a vectorization report from LLVM, and it seems that LLVM maybe vectorizing the code. I am not able to see this on my machine because of the rayon plumber function calls in assembly. Maybe after inlining, the LLVM might be vectorizing as seen in godbolt example.

I need to dig deeper now. But atleast as now, thanks to everyones help, I am heading in the right direction. =) Thanks alot !

What Open MP does and what Rayon does is quite different. Open MP in your code balances works statically, before execution begins. Rayon uses "work stealing" approach, which balances work during execution. This approach has more moving parts and can introduce overhead, but can adapt better to actual available parallelism on the machine. But more importantly, work stealing needs work to steal, and "one chunk per thread" could be just too coarse for this approach to work well.
I recommend to increase amount of chunks. Also, it can be helpful to measure a baseline - how code performs without chunks, with just par iter and working with one element at a time.
Also, if you haven't already, read FAQ page for Rayon: rayon/FAQ.md at main · rayon-rs/rayon · GitHub

1 Like

Also, rust code looks incorrect, it uses indexes from dest chunks into src array, resulting in N copies of first chunk of src array. This could explain the performance difference with the correct zip approach, as cache gets less thrashed.

3 Likes

Ahh yes!
How can I miss this. Even though I have a validator function at the end, that validates the overall results, initializing every element with the same value didnt fail the validator function.
Though I was wondering why some kernels performance reached theoretical peak bandwidth whereas effective bandwidth is usually 10-20% less than theoretical bandwidth.
Thank you for pointing it out. I have changed all the kernels and I can see some believable results.

Also, I tried to change the chunk size, like literally ranging from 2048..problem_size/threads with step_size of 256. I plotted the results in a graph to see if it made any difference but the performance remains the same with the kernels - a flat line with no increase in performance.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.