Long title: Parallel execution of the combination (sum, subtraction, mult,...) of each element of two vectors / arrays with sizes n and m with each element of the element of the other to an output with size n x m and collection into a hashmap.
Hi Rust community!
I have here a system of nested loops where elements of vectors are combined and I want to execute that in parallel. Then I want to create some sort of histogram with a hashmap to get the occurrence of each combination by counting. I decided to use a hashmap because using the keys is practical to me.
A combination here is the absolute value of the difference of each possible pair (in the real case it will be triples and a number will be attached).
The whole code can be also found on the playground: Rust Playground
In v1 it is done in one step, in v2 the combinations are calculated first and then the occurrences therein are counted. From the aspect of functional programming and reusability the v2 appeals better to me, but v1 might need less memory.
//=============================
// alternative variant v1:
let mut grouped = HashMap::new();
for v1 in points1.iter() {
for v2 in points2.iter() {
let diff: i32 = v1  v2;
let norm = diff.abs();
let counter = grouped.entry(norm).or_insert(0);
*counter += 1;
}
}
//=============================
// variant v2:
// preprartion of array that holds the combinations
// an unninitialized vector > array is used because each element will be replaced
let n = points1.len();
let m = points2.len();
let size = n * m;
let mut combinations: Vec<i32> = Vec::with_capacity(size);
unsafe {
combinations.set_len(size);
}
let mut combinations = Array1::from(combinations);
//let mut combinations = Vec::new();
for (i, v1) in points1.iter().enumerate() {
for (j, v2) in points2.iter().enumerate() {
let index = i * n + j;
let diff: i32 = v1  v2;
let norm = diff.abs();
combinations[index] = norm;
//combinations.push(norm);
}
}
let mut grouped = HashMap::new();
combinations.iter().for_each(comb {
let counter = grouped.entry(&*comb).or_insert(0);
*counter += 1;
});
In the following part I atttempt to parallelize the loop to get the combinations from v2 by a) changing the elements by index of a predefined array or b) by collecting into a new vector.
For a) I could use a mutex to be able to borrow the array mulably (but the benchmark test of the optimized version below shows that this does worse than serial execution). Futhermore one needs to get the data back from the mutex.
In b) I have a problem with collecting into a Vec, giving an error.
//=============================
a)
points1.into_par_iter().enumerate().for_each((j, v1) {
points2.par_iter().enumerate().for_each((i, v2) {
let index = j * n + i;
//println!("{:?}", index);
let diff: i32 = v1  v2;
let norm = diff.abs();
let mut arr = combinations.lock().unwrap();
arr[index] = norm;
//combinations[index] = norm;
})
});
// make mutex useable
let combinations = f(combinations);
//=============================
b)
let test: Vec<i32> = points1
.par_iter()
.map(v1 {
points2
.par_iter()
.map(v2 {
let diff: i32 = v1  v2;
let norm = diff.abs();
norm
})
// > "sum" is added here to make the code compilable <
//.sum()
})
.collect();
error[E0277]: the trait bound `Vec<i32>: FromParallelIterator<rayon::iter::Map<rayon::slice::Iter<'_, i32>, [closure@src/main.rs:162:22: 166:18]>>` is not satisfied
> src/main.rs:170:10

170  .collect();
 ^^^^^^^ the trait `FromParallelIterator<rayon::iter::Map<rayon::slice::Iter<'_, i32>, [closure@src/main.rs:162:22: 166:18]>>` is not implemented for `Vec<i32>`

= help: the following implementations were found:
<Vec<T> as FromParallelIterator<T>>
How to resolve it?
I think parallelizing the counting of occurences with the hashmap is a task that might be difficult but the benchmarks show that it is a crucial step.
combinations.iter().for_each(comb {
let counter = grouped.entry(&*comb).or_insert(0);
*counter += 1;
});
What might be a better way? To sort the array beforehand and to add the keys only then, but without checking if an element exists (e.g. by means of an manual index)? Or using an existing crate and transform to a hashmap then?
But. for example in my case, for "already" <100_100 elements (40_000_000_000_000 bytes), my computer runs out of memory and the execution fails. So direct collection seems attractive.
One snapshot of the execution times:
creating points
points created: 56.292ยตs
START direct collection v1
direct collection: 1.532850622s
START combination seperate v2
combination seperate: 140.865953ms
START grouping v2
grouped seperate: 1.580918659s
START alternative a)
alternative a): 5.819199738s
I hope this place is suited to discuss this problem.
Opinions and help for improvement are very welcome. Also if there could be a relevantly faster design in general. Thanks!