I would not expect to see much (if any) benefit from parallelizing this, since it is doing no real computation, and seems likely to be entirely memory-bound. But one approach would be:
If you need this to work on elements that you can't copy or clone cheaply, then it would get more complicated. I'm not sure, but it might require unsafe code to make a parallel by-value iterator that goes by columns.
I would not expect to see much (if any) benefit from parallelizing this, since it is doing no real computation
I want to transpose a vec of shape [K,N] to [N,K] so I can sum over the Ks for each N. I'm not sure if this will be much faster than doing sequentially either.
I am actually translating a piece of parallelized Julia code. At this point I'm basically using thread pools and channels to send chunks of Vecs and returning others. Do you recommend any linear algebra or array crate that would help on this (and somewhat easy to use with Rayon or having builtin parallel primitives)?
You should avoid doing double level of indirection Vec<Vec<>> an alternative is define a type with a simple linear memory allocation and do the math for indexing elements, like so:
use std::ops::{Index, IndexMut};
#[derive(Debug, Default)]
pub struct Matrix<T> {
size: (usize, usize),
data: Vec<T>,
}
impl<T: Default + Copy> Matrix<T> {
fn zeros(x: usize, y: usize) -> Self {
Self {
size: (x, y),
data: vec![T::default(); x * y],
}
}
pub fn new(x: usize, y: usize, data: Vec<T>) -> Self {
Self { size: (x, y), data }
}
fn rows(&self) -> usize {
self.size.0
}
fn cols(&self) -> usize {
self.size.1
}
pub fn transpose(&self) -> Self {
let mut result = Self::zeros(self.rows(), self.cols());
for i in 0..self.rows() {
for j in 0..self.cols() {
result[(i, j)] = self[(j, i)];
}
}
result
}
}
impl<T> Index<(usize, usize)> for Matrix<T> {
type Output = T;
fn index(&self, (x, y): (usize, usize)) -> &Self::Output {
let (w, _) = self.size;
&self.data[y * w + x]
}
}
impl<T> IndexMut<(usize, usize)> for Matrix<T> {
fn index_mut(&mut self, (x, y): (usize, usize)) -> &mut Self::Output {
let (w, _) = self.size;
&mut self.data[y * w + x]
}
}
Those tests are small enough to fit in L1 cache. You'll see a wide variety of affects as you increase the size of the matrix. In one of my tests, I was able to infer the size of each level of cache and the TLB. You might want to consider a block transpose if your problems are large.
my point was to avoid double indirection in general. Now there are well established libraries like nalgebra and ndarray but if you want to do everything yourself go ahead !!!