Bad performance with rayon?

testcasefails · September 15, 2022, 7:20am

Hello!

I am trying some stuff with Rayon and I have found this particular occurrence that I do not understand:

   let arr: [i32; 50 000] = [1; 50 000];
   let it_s = arr.iter(); 

   let now = Instant::now();

    let sum: i32 = it_s
    .sum(); 

    println!("sequential iterator sum {}", now.elapsed().as_nanos() / 10000);


    let it_p = arr2.into_par_iter(); 
    now = Instant::now();

    let sum: i32 = it_p
    .sum(); 

    println!("parallel iterator sum: {}", now.elapsed().as_nanos() / 10000);

sequential iterator sum 193
parallel iterator sum: 1756

why is the performance of my sequential program over 8 times faster than my parallelized version of it? I have this problem for other examples I have played around with an initially thought that the cost lies in creating the parallel iterator however this doesnt make any sense and since it is also more performant compared to a normal iterator as opposed to just a for loop I ruled out that the iterator itself causes the slowdown

erelde · September 15, 2022, 7:39am

Parallel execution has a lot of overhead, summing a small list of numbers isn't a big enough task to pass the threshold where that overhead is smaller than the task itself.

testcasefails · September 15, 2022, 7:57am

thanks, I understand the overhead, however I thought that 50k iterations is not small anymore. with OpenMP on a C program I can achieve better performance as opposed to a slow down. so the sum on parallel iterator is then probably better used with other functions chained before then where more heavier computation takes place as opposed to just cranking up the iterations?

erelde · September 15, 2022, 8:00am

50k complex structs with a complex operation to do could be a large number (depending on the kind of data and operation to do), but 50k integers, by today's standard I think would not be considered large. Even a real large number of integers, if it's just summing up, the compiler may even just optimize that to the final result without any loops.

Hyeonu · September 15, 2022, 8:07am

Have you compiled it with release mode? Since the number you've posted seems weird. On my M1 macbook pro sequential sum takes about 3.8 microseconds. On my AWS lightsail linux instance it's 7.8 microseconds. But you said it takes 1930 microseconds - which is a lot.

Cerber-Ursi · September 15, 2022, 8:10am

It does that indeed - just checked on the playground, sequential sum compiles down to just movl $50000, %eax, while parallel isn't optimized out (why should it?) and actually creates the whole multithreading calculation machinery.

hk1009 · September 15, 2022, 8:17am

Another point that is worth noting is, that the compiler may be able to use simd instructions if it realizes that it is iterating over sequential numbers and doing operations on them, for which vectorization is possible.
In this way, by using rayon you gain one type of parallelism (with its tradeoffs) but are losing another kind of parallelism.
Here are the instructions for the simple sum version on godbolt

maraist · September 15, 2022, 8:59pm

Note, the FIRST execution of rayon requires the launching of threads, which is going to be MANY millis. If you'd run this maybe 5 times sequentially it may provide more comparable results on subsequent runs.

But in general, this type of algorithm is memory bound; not CPU bound. On a modern i1x900K intel; you can run something like 5Ghz single CPU, but only 3.2GHz multi CPU (when the DRAM bus is only 3.2ghz).. So if you're loading 4 4 byte words each clock tick (2x 64bit bus) from DRAM, and the CPU can process 8 adds per clock (but at 1.5x the BUS clock rate), then multi-CPU isn't going to buy you anything. Note rust does SSE or AVX vectorized loop unrolling in single-CPU mode (so you're doing 4..8 adds per clock depending on your compiler settings).

What would perform better is an fft per array cell entry. Something like Vec<[ [ f32 ; 4 ] ; 4 ]> and performing some accumulating fpu matrix norm operation.. This is many hundreds of clock ticks per vector element, and thus could make excellent use of rayon. (small single cache line loads per core with a long delay between loads). Even with the thread launch overhead, this would probably still favor better.

system · December 14, 2022, 8:59pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Optimizations in Rayon help	6	161	February 22, 2025
Rayon par_iter_mut slower than serial help	4	1396	November 23, 2020
Conditionally Switch Between Rayon Parallel and Sequential Iteration Based on Collection Size?	14	308	June 4, 2025
[SOLVED] Use rayon is 7x slower help	6	2144	January 12, 2023
Iterators faster than for loops in parallel help	3	702	August 9, 2020

Bad performance with rayon?

Related topics