Is there anything obviously wrong with this benchmark?

use rayon::iter::{IntoParallelRefMutIterator, ParallelIterator};

#[derive(Clone, Copy)]
pub struct Atom {
    x: u64,
    y: u64,
    z: u64,
    aux: u64,}

impl Atom {
    pub fn new() -> Atom {
        Atom {
            x: 0,
            y: 0,
            z: 0,
            aux: 0,}}}

fn it_works() {
    let mut t = vec![Atom::new(); 1_000_000_000];

    let start = std::time::Instant::now();
    t.iter_mut().for_each(|p| {
        p.x = p.x + 1;
        p.y = p.y + 1;
        p.z = p.z + 1;});
    let mid = std::time::Instant::now();
    t.par_iter_mut().for_each(|p| {
        p.x = p.x + 1;
        p.y = p.y + 1;
        p.z = p.z + 1;});
    let end = std::time::Instant::now();

    println!("1 thread: {:?}", mid - start);
    println!("many thread: {:?}", end - mid);}

1 thread: 7.25635044s
many thread: 3.939349026s

This is on a dual cpu, 6 core/cpu, 12 core, 24 thread old-ish server.

It seems like we're getting 4GB/s on single thread and only 8 GB/s multi thread. Possible explanation is that the machine gets 4 GB/s memory bandwidth per CPU.

Anything obviously wrong with this benchmark ?

It seems to me that the unit of work you assign to each thread is very small, which might result in too much thread management overhead. I'd try to slice up the vec into something like 24 mutable slices, and have each thread work on those.

Does rayon 'group' the work or not? I was under the impression rayon does "group the work", but I could not find evidence supporting my claim.

The principle of rayon is that each work unit can be subdivided into smaller units, and if any threads are idle, they'll grab some part (half, I'd imagine) of what the non-idle thread is currently doing. That's cheaper than queueing every single iterator item for a random thread to pick up, but it's more expensive than a loop that isn't prepared to give up part of what it's doing. The plain loop

    t.iter_mut().for_each(|p| {
        p.x = p.x + 1;
        p.y = p.y + 1;
        p.z = p.z + 1;});

can compile down to the arithmetic, incrementing a counter, and testing if the counter has reached the constant 1_000_000_000. The parallel version has to, at a minimum, have synchronization operations to check if somebody grabbed the second half of its work.