I am implementing polynomial multiplication for polynomials with large coefficients. This is done by doing the multiplication (or convolution if you will) modulo some set of 64-bit primes and then using the Chinese Remainder theorem to reconstruct the result. I split up the work using a parallel iterator from rayon over the vector of primes (the number of primes is around 32-256, and doing the work for one prime is rather expensive, since it is a convolution of a length 2^18 vector which takes a few milliseconds). So the workload should actually be enough that the overhead of rayon is neglegible. However, when benchmarking against the sequential solution using criterion, only around 70% of my CPU is used by the parallel code. I tried to increase the number of threads rayon uses even beyond the number of cores of my CPU, but that didn't help either.

Can you share any of your code? It's possible that you've accidentally introduced a sequential bottleneck or some other inefficiency, but it's hard to guess blindly.

The follwing is in the 'Poly' struct impl. Im not doing a real polynomial multiplication, but something very similar called the middle product, but that should not be relevant to the question. After some assertions, the remainders of the polynomial coefficients modulo the crt primes are computed by crt.split (x and y are Vec<Vec>, for every prime they contain the original vector mod that prime). Then, the middle product modulo those primes is computed for every prime and from these results the actual result is reconstructed by crt.reconstruct.

```
// Let n = self.len(). rhs must be a polynomial of length 2 * n, but with the highest
// coefficient set to 0. Computes a polynomial of length n, where the i-th coefficient is the
// (n - 1 + i)-th coefficient of self * rhs (the resulting polynomial consists of the n middle
// coefficients of self * rhs). This runs twice as fast as doing the same by a standard
// multiplication.
pub fn middle_product(&self, rhs: &Poly, modulus: &Integer) -> Poly {
assert!(self.len().is_power_of_two());
assert_eq!(self.len() * 2, rhs.len());
assert!(rhs.as_vec().last().unwrap().is_zero());
debug_assert!(self
.as_vec()
.iter()
.fold(true, |is_reduced, elem| is_reduced
&& &0 <= elem
&& elem <= modulus));
debug_assert!(rhs.as_vec().iter().fold(true, |is_reduced, elem| is_reduced
&& &0 <= elem
&& elem <= modulus));
let n = self.len() << 1;
let crt = Crt::new(modulus, n);
let mut x = crt.split(self.as_vec());
let y = crt.split(rhs.as_vec());
for x_i in x.iter_mut() {
x_i.resize(n, 0);
}
// here is the core loop, ntt::middle_product is sequential
let z: Vec<Vec<u64>> = crt
.moduli()
.par_iter()
.zip(x.into_par_iter().zip(y.into_par_iter()))
.map(|(p, (x_i, y_i))| ntt::middle_product(x_i, y_i, *p))
.collect();
let mut product = crt.reconstruct(&z);
product.par_iter_mut().for_each(|elem| *elem %= modulus);
Poly(product)
}
```

The crt.split and reconstruct functions are similarly parallelized:

```
// Splits the coefficients of a vector over the CRT moduli.
pub fn split(&self, x: &[Integer]) -> Vec<Vec<u64>> {
let mut result = vec![vec![0u64; x.len()]; self.primes.len()];
x.par_iter().enumerate().for_each(|(i, x_i)| {
self.split_div_and_conq(x_i.clone(), &result, i, 1, 0, self.primes.len())
});
result
}
// Reconstructs a vector from it's congruences modulo the CRT moduli.
pub fn reconstruct(&self, x: &[Vec<u64>]) -> Vec<Integer> {
assert_eq!(x.len(), self.primes.len());
(0..x[0].len())
.into_par_iter()
.map(|i| {
let r = self.reconstruct_div_and_conq(x, i, 1, 0, x.len()) % &self.tree[1];
// Perform a sanity check on the reconstruction when in debug mode.
debug_assert!(self
.primes
.iter()
.enumerate()
.all(|(j, p)| r.mod_u64(*p) == x[j][i] % p));
r
})
.collect()
}
```

The functions they call are sequential.

Does that do anything that would force inter-thread blocking? (like using a `Mutex`

)

Otherwise, nothing looks obviously bad to me. Maybe you can use a tool like `perf`

to see where you are actually spending your time, which may make it clear why that's not parallel. You could also use a debugger to attach during a low CPU-usage part of the run, and get each thread backtrace to see what they're up to and what they may be blocked on.

No, `ntt::middle_product`

is completely independent (it's in fact a pure function). Thank you for these suggestions, I'll try them out. I also tried parallelizing it manually without rayon and got similar results. This suggests that it's definitely not related to rayon.