Parallel Gaussian elimination

I am trying to parallelize an implementation of Gaussian elimination. I am aware that crates such as rayon may make this easier, but my goal is to use a mpsc::channel based implementation, so I chose to use a threadpool.

Here is an overview of the code base. Some of the features are redundant, but this code was ported from C, so I tried to maintain the structure.

The data structure passed around contains the size of the matrix (redundant), the matrix, four verification vectors (used to determine correctness), and the number of threads:

pub struct Data {
    pub nsize: usize,
    pub matrix: Vec<Vec<f64>>,
    pub b: Vec<f64>,
    pub c: Vec<f64>,
    pub v: Vec<f64>,
    pub swap: Vec<u64>,
    pub num_threads: usize,
}

The sequential algorithm:

pub fn compute_gauss(data: &mut Data) {
    for i in 0..data.nsize {
        pivot(data, i);

        let mut pivot_val;

        for j in i + 1..data.nsize {
            pivot_val = data.matrix[j][i];
            data.matrix[j][i] = 0.0;
            for k in i + 1..data.nsize {
                data.matrix[j][k] -= pivot_val * data.matrix[i][k];
            }
            data.b[j] -= pivot_val * data.b[i];
        }
    }
}

the pivot(matrix,index) function mutates the matrix in data

My current implementation of parallel gauss:

pub fn compute_gauss_p(data: Data) -> Data {
    let num_threads = data.num_threads;
    let pool = ThreadPool::new(data.num_threads);
    let nsize = data.nsize;
    let sdata = Arc::new(data);

    for i in 0..nsize {
        pivot(&mut *sdata, i); //compile error, as Arc does not implement DerefMut
        for n in 0..num_threads {
            let sdatacpy = Arc::clone(&sdata);
            pool.execute(move || {
                do_calc(sdatacpy, i + n);
            });
        }
        pool.join();
    }

    Arc::try_unwrap(sdata).unwrap()
}
fn do_calc(data: Arc<Data>, i: usize) { //cannot derefence mut from Arc
    let mut pivot_val;
    let num_threads = data.num_threads; 
    for j in (i + 1..data.matrix.len()).step_by(num_threads) {
        pivot_val = data.matrix[j][i];
        data.matrix[j][i] = 0.0;
        for k in i + 1..data.matrix.len() {
            data.matrix[j][k] -= pivot_val * data.matrix[i][k];
        }
        data.b[j] -= pivot_val * data.b[i];
    }
}

I have thought about using a Mutex to exclusivise the matrix, but that would sequentialize the work.
My next thought is to treat the matrix as a vector of mutex's, each containing a row. This would allow me to divide up the work. matrix would be:

Vec<Mutex<Vec<f64>>>

However, I am wondering if this is the correct path. I would like to avoid using unsafe pointer references (I could just use my C version) while continuing to use a paradigm which has mpsc::channel's (no rayon or other library that doesn't use channels).

My rough (untested) translation of this for rayon would look something like this:

pub fn compute_gauss(data: &mut Data) {
    for i in 0..data.nsize {
        pivot(data, i);

        let (row_i, rows) = data.matrix[i..data.nsize].split_first_mut().unwrap();

        rows.par_iter_mut().for_each(|row_j| {
            let pivot_val = row_j[i];
            row_j[i] = 0.0;
            for k in i + 1..data.nsize {
                row_j[k] -= pivot_val * row_i[k];
            }
            data.b[j] -= pivot_val * data.b[i];
        });
    }
}

You probably need to pull out references to those other data members too before the parallel loop to avoid borrowing conflicts in the closure. Anyway, the idea is to split the mutable part so it can work independently, which you can also do with other threadpool implementations.

edit: I missed that data.b is also modified -- definitely need to fix that, but you can split it the same way as the matrix and use a parallel zip.

I would try to split the data that the different threads have to access in contiguous blocks, then use split_at_mut and chunks_mut to actually obtain them and send them to the different threads.

This however requires ThreadPool::execute to be able to accept non-'static closures, which doesn't seem the case. You need something like rayon::scope or crossbeam::scope for this, but at that point you may as well use rayon's ParallelIterator.