Is there anything obviously wrong with this benchmark?

zeroexcuses · September 2, 2022, 5:35am

use rayon::iter::{IntoParallelRefMutIterator, ParallelIterator};

#[derive(Clone, Copy)]
pub struct Atom {
    x: u64,
    y: u64,
    z: u64,
    aux: u64,}

impl Atom {
    pub fn new() -> Atom {
        Atom {
            x: 0,
            y: 0,
            z: 0,
            aux: 0,}}}

#[test]
fn it_works() {
    let mut t = vec![Atom::new(); 1_000_000_000];

    let start = std::time::Instant::now();
    t.iter_mut().for_each(|p| {
        p.x = p.x + 1;
        p.y = p.y + 1;
        p.z = p.z + 1;});
    let mid = std::time::Instant::now();
    t.par_iter_mut().for_each(|p| {
        p.x = p.x + 1;
        p.y = p.y + 1;
        p.z = p.z + 1;});
    let end = std::time::Instant::now();

    println!("1 thread: {:?}", mid - start);
    println!("many thread: {:?}", end - mid);}

1 thread: 7.25635044s
many thread: 3.939349026s

This is on a dual cpu, 6 core/cpu, 12 core, 24 thread old-ish server.

It seems like we're getting 4GB/s on single thread and only 8 GB/s multi thread. Possible explanation is that the machine gets 4 GB/s memory bandwidth per CPU.

Anything obviously wrong with this benchmark ?

KillTheMule · September 2, 2022, 8:03am

It seems to me that the unit of work you assign to each thread is very small, which might result in too much thread management overhead. I'd try to slice up the vec into something like 24 mutable slices, and have each thread work on those.

zeroexcuses · September 2, 2022, 8:14am

Does rayon 'group' the work or not? I was under the impression rayon does "group the work", but I could not find evidence supporting my claim.

kpreid · September 2, 2022, 2:11pm

The principle of rayon is that each work unit can be subdivided into smaller units, and if any threads are idle, they'll grab some part (half, I'd imagine) of what the non-idle thread is currently doing. That's cheaper than queueing every single iterator item for a random thread to pick up, but it's more expensive than a loop that isn't prepared to give up part of what it's doing. The plain loop

    t.iter_mut().for_each(|p| {
        p.x = p.x + 1;
        p.y = p.y + 1;
        p.z = p.z + 1;});

can compile down to the arithmetic, incrementing a counter, and testing if the counter has reached the constant 1_000_000_000. The parallel version has to, at a minimum, have synchronization operations to check if somebody grabbed the second half of its work.

system · December 1, 2022, 2:11pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Bad performance with rayon?	8	1428	December 14, 2022
New version of mandel-rust: uses Rayon, added benchmark announcements	38	5593	January 12, 2023
Iterators faster than for loops in parallel help	3	606	August 9, 2020
Simple iter 5x slower than a for loop?	14	779	November 12, 2020
Speeding up parallel iteration over large data help	9	1517	June 7, 2022

Is there anything obviously wrong with this benchmark?

Related Topics