Rayon transpose of Vec<Vec<T>>

Hi folks

I have seen this implementation of an iterator based transpose operation for Vec<Vec> and I wasn't able to use Rayon to speed things up:

fn transpose<T>(v: Vec<Vec<T>>) -> Vec<Vec<T>> {
    assert!(!v.is_empty());
    let len = v[0].len();
    let mut iters: Vec<_> = v.into_iter().map(|n| n.into_iter()).collect();
    (0..len)
        .map(|_| {
            iters
                .iter_mut()
                .map(|n| n.next().unwrap())
                .collect::<Vec<T>>()
        })
        .collect()
}

In my use case, v's second dimension (aka len) is much bigger than the first. Any ideas on what I could do here?

I would not expect to see much (if any) benefit from parallelizing this, since it is doing no real computation, and seems likely to be entirely memory-bound. But one approach would be:

pub fn transpose<T: Send + Sync + Copy>(v: Vec<Vec<T>>) -> Vec<Vec<T>> {
    assert!(!v.is_empty());
    let len = v[0].len();
    (0..len).into_par_iter()
        .map(|i| v.iter().map(|row| row[i]).collect())
        .collect()
}

(Playground)

If you need this to work on elements that you can't copy or clone cheaply, then it would get more complicated. I'm not sure, but it might require unsafe code to make a parallel by-value iterator that goes by columns.

2 Likes

I'd use an actual 2d array type for this, there are several crates that provide those.

1 Like

I would not expect to see much (if any) benefit from parallelizing this, since it is doing no real computation

I want to transpose a vec of shape [K,N] to [N,K] so I can sum over the Ks for each N. I'm not sure if this will be much faster than doing sequentially either.

I am actually translating a piece of parallelized Julia code. At this point I'm basically using thread pools and channels to send chunks of Vecs and returning others. Do you recommend any linear algebra or array crate that would help on this (and somewhat easy to use with Rayon or having builtin parallel primitives)?

You should avoid doing double level of indirection Vec<Vec<>> an alternative is define a type with a simple linear memory allocation and do the math for indexing elements, like so:

use std::ops::{Index, IndexMut};

#[derive(Debug, Default)]
pub struct Matrix<T> {
    size: (usize, usize),
    data: Vec<T>,
}

impl<T: Default + Copy> Matrix<T> {
    fn zeros(x: usize, y: usize) -> Self {
        Self {
            size: (x, y),
            data: vec![T::default(); x * y],
        }
    }

    pub fn new(x: usize, y: usize, data: Vec<T>) -> Self {
        Self { size: (x, y), data }
    }

    fn rows(&self) -> usize {
        self.size.0
    }

    fn cols(&self) -> usize {
        self.size.1
    }

    pub fn transpose(&self) -> Self {
        let mut result = Self::zeros(self.rows(), self.cols());
        for i in 0..self.rows() {
            for j in 0..self.cols() {
                result[(i, j)] = self[(j, i)];
            }
        }
        result
    }
}

impl<T> Index<(usize, usize)> for Matrix<T> {
    type Output = T;
    fn index(&self, (x, y): (usize, usize)) -> &Self::Output {
        let (w, _) = self.size;
        &self.data[y * w + x]
    }
}

impl<T> IndexMut<(usize, usize)> for Matrix<T> {
    fn index_mut(&mut self, (x, y): (usize, usize)) -> &mut Self::Output {
        let (w, _) = self.size;
        &mut self.data[y * w + x]
    }
}

i have a 3x speed-up with that change

here is the benchs:

2 Likes

Those tests are small enough to fit in L1 cache. You'll see a wide variety of affects as you increase the size of the matrix. In one of my tests, I was able to infer the size of each level of cache and the TLB. You might want to consider a block transpose if your problems are large.

1 Like

The size N will certainly not be small, eg., 128, 256, 1k, 2k, 16k and beyond.

This transpose is related to another post I did. Perhaps replacing Vec<Vec<f64x4>> with the Matrix and block transpose might give enough boost.

my point was to avoid double indirection in general. Now there are well established libraries like nalgebra and ndarray but if you want to do everything yourself go ahead !!!

2 Likes

Ndarray has a rayon feature, maybe that fits your usecase?

I tried it, but the performance was actually worse :confused: