Rayon based parallel algorithm slower than non-parallel algorithm

anthomedal · February 11, 2020, 8:17pm

Hello everyone,

I try to create function that implement the Doolitle algorithm to perform LU decomposition of matrix, so I created this function :

pub fn dense(a: &Dense) -> (LowerTriangular, UpperTriangular)
{
    /*
    Implement LU decomposition for dense matrices
    This function use Doolitle Algorithm 
    * / 

    if a.n == a.m 
    {
        let n = a.n;

        let mut l = LowerTriangular::eye(n);
        let mut u = UpperTriangular::zeros(n); 

        let mut sum = 0f64;

        for i in 0..(n-1)
        {
            for j in i..n 
            {
                // Sum calculus 
                sum = 0f64;

                for k in 0..i
                {
                    sum = sum + l.get(i,k)*u.get(k,j);
                }


                u.set(i,j, a.get(i,j) - sum);
            }

            for j in (i+1)..n 
            {
                // Sul calculus
                sum = 0f64;

                for k in 0..i
                {
                    sum = sum + l.get(j,k)*u.get(k,i);
                }

                l.set(j,i, (1f64/u.get(i,i)*(a.get(j,i)-sum)));
            }
        }

        // Sum calculus 
        sum = 0f64; 

        for k in 0..n
        {
            sum = sum + l.get(n-1,k)*u.get(k,n-1);
        }

        u.set(n-1, n-1, a.get(n-1,n-1) - sum);

        return (l,u);
    }
    else
    {
        panic!("Impossible LU factorization");
    }
}

Which works fine, finally I want to create parallel function that do the same (I use https://docs.rs/rayon/1.3.0/rayon/ ):

pub fn par_dense2(a: &Dense) -> (LowerTriangular, UpperTriangular)
{
    /*
    Implement LU decomposition for dense matrices
    This function use Doolitle Algorithm 
    */ 

    if a.n == a.m 
    {
        let n = a.n;

        let mut l = Arc::new(RwLock::new(LowerTriangular::eye(n)));
         let mut u = Arc::new(RwLock::new(UpperTriangular::zeros(n))); 

        // Variables for parallel operations 
        let mut iterator: Vec<usize>;

        for i in 0..(n-1)
        {
            iterator = (i..n).collect();

           iterator.par_iter().for_each(|j|
            {
                // Sum calculus 
                let mut sum = 0f64;

                for k in 0..i
                {
                    sum = sum + (l.read().unwrap()).get(i,k)*(u.read().unwrap()).get(k,*j);
                }


               u.write().unwrap().set(i,*j, a.get(i,*j) - sum);
            });

            iterator = ((i+1)..n).collect();

            iterator.par_iter().for_each(|j|
            {
                // Sul calculus
                let mut sum = 0f64;

                for k in 0..i
                {
                    sum = sum + (l.read().unwrap()).get(*j,k)*(u.read().unwrap()).get(k,i);
                }

                l.write().unwrap().set(*j,i, (1f64/(u.read().unwrap()).get(i,i)*(a.get(*j,i)-sum)));
            });
        }

        // Sum calculus 
        let mut sum = 0f64; 

        for k in 0..n
        {
            sum = sum + (l.read().unwrap()).get(n-1,k)*(u.read().unwrap()).get(k,n-1);
        }

        u.write().unwrap().set(n-1, n-1, a.get(n-1,n-1) - sum);

        return     (Arc::try_unwrap(l).unwrap().into_inner().unwrap(),Arc::try_unwrap(u).unwrap().into_inner().unwrap());
    }
    else
    {
        panic!("Impossible LU factorization");
    }
}

I'm sorry for the length of the code, but in fact the first function, with 1000x1000 matrix, need 3 seconds and the second function need more than 1 minute. I tried with Mutex instead of RwLock but it's (surprisingly) faster but always around 1 minute. I don't understand why parallel version is so slow, and I would like a solution to optimize my parallel function.

Thank you !

alice · February 11, 2020, 8:20pm

You're paying a rather large cost to make the threads wait for each other by using locks. Since you do so much locking and unlocking (every call to read or write), the cost is very large.

Can you post the definition of those types, so I can try it out for myself?

anthomedal · February 11, 2020, 8:23pm

That's why I choose to use RwLock instead of Mutex, in my mind RwLock don't lock thread when the thread read value.

anthomedal · February 11, 2020, 8:24pm

I would like to post entire source code but I don't know how on the forum ^^'

edit : I create github

alice · February 11, 2020, 8:25pm

RwLock still does locking when reading, in fact the locking operation is more expensive than the mutex afaik. It's just that it allows multiple readers, so if the reading operation is much larger than the locking, it can be more efficient.

Try to post a minimal snippet on the playground which you can share with the share button. Try to just include the relevant parts.

anthomedal · February 11, 2020, 8:30pm

The full code is here : https://github.com/Anthomedal/rustnum

(in the 7z archive)

Functions are in ./rustnum/src/linear_algebra/operations/decomposition/lu.rs

RustyYato · February 11, 2020, 8:31pm

Locks are expensive, and wrapping the entire matrix in a lock will lose all thr benefits of parallel computing because everytime you call RwLock::write, no other thread can do anything. This basically forces rayon to work in serial rather than in parallel. This setup also has high contention in the locks, due to all if the threads hammering the same lock, which can also cause it to slow down.

One way to make this faster would be to use atomics (but that's hard to do with floats), or by wrapping each element in a lock to reduce contention.

anthomedal · February 11, 2020, 8:37pm

Okay, I see the problem

What do you mean by "element" ?

RustyYato · February 11, 2020, 8:38pm

Every element if the matrix

anthomedal · February 11, 2020, 8:42pm

Oh okay, it can be done by creating 2 others matrices only for reading and updated after each loop circuit (in i), but it's not clean in my opinion...

alice · February 11, 2020, 9:03pm

You would probably want to write the result to a separate buffer, then copy it into the matrix. Something like this:

// Reuse the memory allocation.
let mut buffer = vec![0f64; n];

for i in 0..(n-1) {
    (&mut buffer[i..n]).par_iter_mut().enumerate().for_each(|(j, ptr)| {
        // adjust the index (enumerate starts at zero)
        let j = i + j;
        // Sum calculus 
        let mut sum = 0f64;

        for k in 0..i {
            sum = sum + l.get(i,k)*u.get(k,j);
        }

        *ptr = a.get(i,j) - sum;
    });
    for j in i..n {
        u.set(i,j, buffer[j]);
    }

    (&mut buffer[(i+1)..n]).par_iter_mut().enumerate().for_each(|(j, ptr)| {
        let j = i+1 + j;
        // Sul calculus
        let mut sum = 0f64;

        for k in 0..i {
            sum = sum + l.get(j,k)*u.get(k,i);
        }

        *ptr = 1f64/u.get(i,i)*(a.get(j,i)-sum);
    });
    for j in (i+1)..n {
        u.set(j,i, buffer[j]);
    }
}

anthomedal · February 12, 2020, 7:53am

You gave me an idea :

let mut l = LowerTriangular::eye(n);
    let mut u = UpperTriangular::zeros(n); 

    // Variables for parallel operations 
    let mut iterator: Vec<usize>;
    let mut v = Arc::new(Mutex::new(vec![0f64;n]));


    for i in 0..(n-1)
    {
        println!("{}", i);
        iterator = (i..n).collect();

        iterator.par_iter().for_each(|j|
        {
            // Sum calculus 
            let mut sum = 0f64;

            for k in 0..i
            {
                sum = sum + l.fast_get(i,k)*u.fast_get(k,*j);
            }


            v.lock().unwrap()[*j] = a.fast_get(i,*j) - sum;
        });

        for j in i..n 
        {
            u.fast_set(i,j, v.lock().unwrap()[j]);
        }

        iterator = ((i+1)..n).collect();

        iterator.par_iter().for_each(|j|
        {
            // Sul calculus
            let mut sum = 0f64;

            for k in 0..i
            {
                sum = sum + l.fast_get(*j,k)*u.fast_get(k,i);
            }

            v.lock().unwrap()[*j] =  (1f64/u.fast_get(i,i))*(a.fast_get(*j,i)-sum);
        });

        for j in (i+1)..n 
        {
            l.fast_set(j,i, v.lock().unwrap()[j]);
        }
    }

Which works fine ! Thank you !

system · May 12, 2020, 7:53am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rayon is slower than serial algorithm	15	1971	October 30, 2020
Rayon par_iter_mut slower than serial help	4	1136	November 23, 2020
Parallel Processing Optimization code review	5	811	October 23, 2020
Rayon profiling parallelized pixel computations help	7	211	March 9, 2024
New version of mandel-rust: uses Rayon, added benchmark announcements	38	5593	January 12, 2023

Rayon based parallel algorithm slower than non-parallel algorithm

Related Topics