Hi everyone! I have simplify my problem into the following code example.
Specifically, due to the exist of index（mapping）, I need another loop to aggregate the results from parallel iterator.
Can we avoid the aggregation loop with rayon or other methods?

use rayon::prelude::*;
use std::collections::HashMap; // 1.7.0
fn main() {
let index = HashMap::from([(0, 0), (1, 1), (2, 0), (3, 1)]);
let results: Vec<(usize, usize)> = (0..4).into_par_iter().map(|i| (index[&i], i * i)).collect();
let mut boxed = vec![0; 2];
for (index, delta) in results {
boxed[index] += delta;
}
println!("{:?}", boxed);
return;
}

Not without some kind of synchronization of the stored results (e.g. using a Mutex or some Atomic type), in which case you probably need to do some benchmarking of your real world data to see if it actually makes sense. It might be the case that adding this synchronization actually makes things slower than your current solution.

Here's an example using Atomics to store the results.

use rayon::prelude::*;
use std::sync::atomic::{AtomicU32, Ordering};
use std::collections::HashMap;
fn main() {
let index = HashMap::from([(0, 0), (1, 1), (2, 0), (3, 1)]);
let boxed = vec![AtomicU32::new(0), AtomicU32::new(0)];
(0..4).into_par_iter().for_each(|i| {
let index = index[&i];
boxed[index].fetch_add(i * i, Ordering::SeqCst);
});
println!("{:?}", boxed);
return;
}

Looks like you should be able to do some aggregation of parts of the sequence using .fold(…), then you can reduce them down to a single one afterwards - both still parallelized.