Rayon based parallel algorithm slower than non-parallel algorithm

Hello everyone,

I try to create function that implement the Doolitle algorithm to perform LU decomposition of matrix, so I created this function :

pub fn dense(a: &Dense) -> (LowerTriangular, UpperTriangular)
{
    /*
    Implement LU decomposition for dense matrices
    This function use Doolitle Algorithm 
    * / 

    if a.n == a.m 
    {
        let n = a.n;

        let mut l = LowerTriangular::eye(n);
        let mut u = UpperTriangular::zeros(n); 

        let mut sum = 0f64;

        for i in 0..(n-1)
        {
            for j in i..n 
            {
                // Sum calculus 
                sum = 0f64;

                for k in 0..i
                {
                    sum = sum + l.get(i,k)*u.get(k,j);
                }


                u.set(i,j, a.get(i,j) - sum);
            }

            for j in (i+1)..n 
            {
                // Sul calculus
                sum = 0f64;

                for k in 0..i
                {
                    sum = sum + l.get(j,k)*u.get(k,i);
                }

                l.set(j,i, (1f64/u.get(i,i)*(a.get(j,i)-sum)));
            }
        }

        // Sum calculus 
        sum = 0f64; 

        for k in 0..n
        {
            sum = sum + l.get(n-1,k)*u.get(k,n-1);
        }

        u.set(n-1, n-1, a.get(n-1,n-1) - sum);

        return (l,u);
    }
    else
    {
        panic!("Impossible LU factorization");
    }
}

Which works fine, finally I want to create parallel function that do the same (I use https://docs.rs/rayon/1.3.0/rayon/ ):

pub fn par_dense2(a: &Dense) -> (LowerTriangular, UpperTriangular)
{
    /*
    Implement LU decomposition for dense matrices
    This function use Doolitle Algorithm 
    */ 

    if a.n == a.m 
    {
        let n = a.n;

        let mut l = Arc::new(RwLock::new(LowerTriangular::eye(n)));
         let mut u = Arc::new(RwLock::new(UpperTriangular::zeros(n))); 

        // Variables for parallel operations 
        let mut iterator: Vec<usize>;

        for i in 0..(n-1)
        {
            iterator = (i..n).collect();

           iterator.par_iter().for_each(|j|
            {
                // Sum calculus 
                let mut sum = 0f64;

                for k in 0..i
                {
                    sum = sum + (l.read().unwrap()).get(i,k)*(u.read().unwrap()).get(k,*j);
                }


               u.write().unwrap().set(i,*j, a.get(i,*j) - sum);
            });

            iterator = ((i+1)..n).collect();

            iterator.par_iter().for_each(|j|
            {
                // Sul calculus
                let mut sum = 0f64;

                for k in 0..i
                {
                    sum = sum + (l.read().unwrap()).get(*j,k)*(u.read().unwrap()).get(k,i);
                }

                l.write().unwrap().set(*j,i, (1f64/(u.read().unwrap()).get(i,i)*(a.get(*j,i)-sum)));
            });
        }

        // Sum calculus 
        let mut sum = 0f64; 

        for k in 0..n
        {
            sum = sum + (l.read().unwrap()).get(n-1,k)*(u.read().unwrap()).get(k,n-1);
        }

        u.write().unwrap().set(n-1, n-1, a.get(n-1,n-1) - sum);

        return     (Arc::try_unwrap(l).unwrap().into_inner().unwrap(),Arc::try_unwrap(u).unwrap().into_inner().unwrap());
    }
    else
    {
        panic!("Impossible LU factorization");
    }
}

I'm sorry for the length of the code, but in fact the first function, with 1000x1000 matrix, need 3 seconds and the second function need more than 1 minute. I tried with Mutex instead of RwLock but it's (surprisingly) faster but always around 1 minute. I don't understand why parallel version is so slow, and I would like a solution to optimize my parallel function.

Thank you !

You're paying a rather large cost to make the threads wait for each other by using locks. Since you do so much locking and unlocking (every call to read or write), the cost is very large.

Can you post the definition of those types, so I can try it out for myself?

2 Likes

That's why I choose to use RwLock instead of Mutex, in my mind RwLock don't lock thread when the thread read value.

I would like to post entire source code but I don't know how on the forum ^^'

edit : I create github

RwLock still does locking when reading, in fact the locking operation is more expensive than the mutex afaik. It's just that it allows multiple readers, so if the reading operation is much larger than the locking, it can be more efficient.

Try to post a minimal snippet on the playground which you can share with the share button. Try to just include the relevant parts.

3 Likes

The full code is here : https://github.com/Anthomedal/rustnum

(in the 7z archive)

Functions are in ./rustnum/src/linear_algebra/operations/decomposition/lu.rs

Locks are expensive, and wrapping the entire matrix in a lock will lose all thr benefits of parallel computing because everytime you call RwLock::write, no other thread can do anything. This basically forces rayon to work in serial rather than in parallel. This setup also has high contention in the locks, due to all if the threads hammering the same lock, which can also cause it to slow down.

One way to make this faster would be to use atomics (but that's hard to do with floats), or by wrapping each element in a lock to reduce contention.

1 Like

Okay, I see the problem

What do you mean by "element" ?

Every element if the matrix

1 Like

Oh okay, it can be done by creating 2 others matrices only for reading and updated after each loop circuit (in i), but it's not clean in my opinion...

You would probably want to write the result to a separate buffer, then copy it into the matrix. Something like this:

// Reuse the memory allocation.
let mut buffer = vec![0f64; n];

for i in 0..(n-1) {
    (&mut buffer[i..n]).par_iter_mut().enumerate().for_each(|(j, ptr)| {
        // adjust the index (enumerate starts at zero)
        let j = i + j;
        // Sum calculus 
        let mut sum = 0f64;

        for k in 0..i {
            sum = sum + l.get(i,k)*u.get(k,j);
        }

        *ptr = a.get(i,j) - sum;
    });
    for j in i..n {
        u.set(i,j, buffer[j]);
    }

    (&mut buffer[(i+1)..n]).par_iter_mut().enumerate().for_each(|(j, ptr)| {
        let j = i+1 + j;
        // Sul calculus
        let mut sum = 0f64;

        for k in 0..i {
            sum = sum + l.get(j,k)*u.get(k,i);
        }

        *ptr = 1f64/u.get(i,i)*(a.get(j,i)-sum);
    });
    for j in (i+1)..n {
        u.set(j,i, buffer[j]);
    }
}
1 Like

You gave me an idea :

let mut l = LowerTriangular::eye(n);
    let mut u = UpperTriangular::zeros(n); 

    // Variables for parallel operations 
    let mut iterator: Vec<usize>;
    let mut v = Arc::new(Mutex::new(vec![0f64;n]));


    for i in 0..(n-1)
    {
        println!("{}", i);
        iterator = (i..n).collect();

        iterator.par_iter().for_each(|j|
        {
            // Sum calculus 
            let mut sum = 0f64;

            for k in 0..i
            {
                sum = sum + l.fast_get(i,k)*u.fast_get(k,*j);
            }


            v.lock().unwrap()[*j] = a.fast_get(i,*j) - sum;
        });

        for j in i..n 
        {
            u.fast_set(i,j, v.lock().unwrap()[j]);
        }

        iterator = ((i+1)..n).collect();

        iterator.par_iter().for_each(|j|
        {
            // Sul calculus
            let mut sum = 0f64;

            for k in 0..i
            {
                sum = sum + l.fast_get(*j,k)*u.fast_get(k,i);
            }

            v.lock().unwrap()[*j] =  (1f64/u.fast_get(i,i))*(a.fast_get(*j,i)-sum);
        });

        for j in (i+1)..n 
        {
            l.fast_set(j,i, v.lock().unwrap()[j]);
        }
    }

Which works fine ! Thank you !