I'm working on an encryption library, and one of the steps is to encrypt a bunch of blocks in parallel. I'm using a very simple rayon
implementation for that, and overall it works pretty well -- up to a point, more threads equals more performance. However, what I see is that even on a machine with plenty of headroom, 32 threads is beating 64, which makes me suspect that we can do better. Here's the naive implementation:
fn encrypt(
block_pairs: Vec<(&[u8], &mut [u8])>,
key: &Key,
) -> Result<()> {
block_pairs
.into_par_iter()
.enumerate()
.try_for_each(|(block_num, (plaintext, ciphertext))| {
encrypt_block(block_num, plaintext, ciphertext, key)
})
}
Based on some perf
dumps though, there's a decent amount of overhead from rayon
at some point, and intuitively I'd imagine that we could improve the performance here if each thread did a better job with data locality, so that it isn't hopping all over the buffers and is instead staying in a more bounded memory region.
So okay, let's try something like this:
let pool = ThreadPoolBuilder::new()
.num_threads(num_threads)
.build()?;
let chunk_size = (block_pairs.len() + num_threads - 1) / num_threads;
pool.install(|| {
let chunks: Vec<_> = block_pairs.chunks_mut(chunk_size).enumerate().collect();
pool.scope(|s| {
for (chunk_index, chunk) in chunks {
s.spawn(move |_| {
chunk
.iter_mut()
.enumerate()
.try_for_each(|(i, (plaintext, ciphertext))| {
let block_num = chunk_index * chunk_size + i + block_offset;
encrypt_block(block_num, plaintext, ciphertext, key)
})
.unwrap();
});
}
});
});
Now we process contiguous chunks linearly, but we spawn a bit of work for rayon
across each chunk. Now, this actually does improve performance across many many threads, but not enough that the 64 thread case beats the 32 thread case. I tried something similar with with_min_len
, and that's similarly not very impactful.
My other note from profiling is that on a machine with a lot of cores, rayon
is happy to bounce around between all of them, and my suspicion is that isn't helping either, but I could certainly be wrong there. I'm half tempted to use actual std::thread
, but that's quite a bit messier and I'd love this to be relatively simple.
Are there any other parallelism strategies I should be considering, or simple rayon
tweaks that might improve performance across many many threads? I did a quick test of splitting up the buffers into regions first, and that actually made it quite a bit worse, but I certainly might have messed that up.
I'm gonna test out a crossbeam
thread version as well, but I'd appreciate any thoughts!