Hello everyone,

I try to create function that implement the Doolitle algorithm to perform LU decomposition of matrix, so I created this function :

```
pub fn dense(a: &Dense) -> (LowerTriangular, UpperTriangular)
{
/*
Implement LU decomposition for dense matrices
This function use Doolitle Algorithm
* /
if a.n == a.m
{
let n = a.n;
let mut l = LowerTriangular::eye(n);
let mut u = UpperTriangular::zeros(n);
let mut sum = 0f64;
for i in 0..(n-1)
{
for j in i..n
{
// Sum calculus
sum = 0f64;
for k in 0..i
{
sum = sum + l.get(i,k)*u.get(k,j);
}
u.set(i,j, a.get(i,j) - sum);
}
for j in (i+1)..n
{
// Sul calculus
sum = 0f64;
for k in 0..i
{
sum = sum + l.get(j,k)*u.get(k,i);
}
l.set(j,i, (1f64/u.get(i,i)*(a.get(j,i)-sum)));
}
}
// Sum calculus
sum = 0f64;
for k in 0..n
{
sum = sum + l.get(n-1,k)*u.get(k,n-1);
}
u.set(n-1, n-1, a.get(n-1,n-1) - sum);
return (l,u);
}
else
{
panic!("Impossible LU factorization");
}
}
```

Which works fine, finally I want to create parallel function that do the same (I use https://docs.rs/rayon/1.3.0/rayon/ ):

```
pub fn par_dense2(a: &Dense) -> (LowerTriangular, UpperTriangular)
{
/*
Implement LU decomposition for dense matrices
This function use Doolitle Algorithm
*/
if a.n == a.m
{
let n = a.n;
let mut l = Arc::new(RwLock::new(LowerTriangular::eye(n)));
let mut u = Arc::new(RwLock::new(UpperTriangular::zeros(n)));
// Variables for parallel operations
let mut iterator: Vec<usize>;
for i in 0..(n-1)
{
iterator = (i..n).collect();
iterator.par_iter().for_each(|j|
{
// Sum calculus
let mut sum = 0f64;
for k in 0..i
{
sum = sum + (l.read().unwrap()).get(i,k)*(u.read().unwrap()).get(k,*j);
}
u.write().unwrap().set(i,*j, a.get(i,*j) - sum);
});
iterator = ((i+1)..n).collect();
iterator.par_iter().for_each(|j|
{
// Sul calculus
let mut sum = 0f64;
for k in 0..i
{
sum = sum + (l.read().unwrap()).get(*j,k)*(u.read().unwrap()).get(k,i);
}
l.write().unwrap().set(*j,i, (1f64/(u.read().unwrap()).get(i,i)*(a.get(*j,i)-sum)));
});
}
// Sum calculus
let mut sum = 0f64;
for k in 0..n
{
sum = sum + (l.read().unwrap()).get(n-1,k)*(u.read().unwrap()).get(k,n-1);
}
u.write().unwrap().set(n-1, n-1, a.get(n-1,n-1) - sum);
return (Arc::try_unwrap(l).unwrap().into_inner().unwrap(),Arc::try_unwrap(u).unwrap().into_inner().unwrap());
}
else
{
panic!("Impossible LU factorization");
}
}
```

I'm sorry for the length of the code, but in fact the first function, with 1000x1000 matrix, need 3 seconds and the second function need more than 1 minute. I tried with Mutex instead of RwLock but it's (surprisingly) faster but always around 1 minute. I don't understand why parallel version is so slow, and I would like a solution to optimize my parallel function.

Thank you !