Bad performance with rayon?

Hello!

I am trying some stuff with Rayon and I have found this particular occurrence that I do not understand:

   let arr: [i32; 50 000] = [1; 50 000];
   let it_s = arr.iter(); 

   let now = Instant::now();

    let sum: i32 = it_s
    .sum(); 

    println!("sequential iterator sum {}", now.elapsed().as_nanos() / 10000);


    let it_p = arr2.into_par_iter(); 
    now = Instant::now();

    let sum: i32 = it_p
    .sum(); 

    println!("parallel iterator sum: {}", now.elapsed().as_nanos() / 10000);

sequential iterator sum 193
parallel iterator sum: 1756

why is the performance of my sequential program over 8 times faster than my parallelized version of it? I have this problem for other examples I have played around with an initially thought that the cost lies in creating the parallel iterator however this doesnt make any sense and since it is also more performant compared to a normal iterator as opposed to just a for loop I ruled out that the iterator itself causes the slowdown

Parallel execution has a lot of overhead, summing a small list of numbers isn't a big enough task to pass the threshold where that overhead is smaller than the task itself.

7 Likes

thanks, I understand the overhead, however I thought that 50k iterations is not small anymore. with OpenMP on a C program I can achieve better performance as opposed to a slow down. so the sum on parallel iterator is then probably better used with other functions chained before then where more heavier computation takes place as opposed to just cranking up the iterations?

50k complex structs with a complex operation to do could be a large number (depending on the kind of data and operation to do), but 50k integers, by today's standard I think would not be considered large. Even a real large number of integers, if it's just summing up, the compiler may even just optimize that to the final result without any loops.

Have you compiled it with release mode? Since the number you've posted seems weird. On my M1 macbook pro sequential sum takes about 3.8 microseconds. On my AWS lightsail linux instance it's 7.8 microseconds. But you said it takes 1930 microseconds - which is a lot.

4 Likes

It does that indeed - just checked on the playground, sequential sum compiles down to just movl $50000, %eax, while parallel isn't optimized out (why should it?) and actually creates the whole multithreading calculation machinery.

6 Likes

Another point that is worth noting is, that the compiler may be able to use simd instructions if it realizes that it is iterating over sequential numbers and doing operations on them, for which vectorization is possible.
In this way, by using rayon you gain one type of parallelism (with its tradeoffs) but are losing another kind of parallelism.
Here are the instructions for the simple sum version on godbolt

2 Likes

Note, the FIRST execution of rayon requires the launching of threads, which is going to be MANY millis. If you'd run this maybe 5 times sequentially it may provide more comparable results on subsequent runs.

But in general, this type of algorithm is memory bound; not CPU bound. On a modern i1x900K intel; you can run something like 5Ghz single CPU, but only 3.2GHz multi CPU (when the DRAM bus is only 3.2ghz).. So if you're loading 4 4 byte words each clock tick (2x 64bit bus) from DRAM, and the CPU can process 8 adds per clock (but at 1.5x the BUS clock rate), then multi-CPU isn't going to buy you anything. Note rust does SSE or AVX vectorized loop unrolling in single-CPU mode (so you're doing 4..8 adds per clock depending on your compiler settings).

What would perform better is an fft per array cell entry. Something like Vec<[ [ f32 ; 4 ] ; 4 ]> and performing some accumulating fpu matrix norm operation.. This is many hundreds of clock ticks per vector element, and thus could make excellent use of rayon. (small single cache line loads per core with a long delay between loads). Even with the thread launch overhead, this would probably still favor better.