Rayon and work locality over large buffers with large thread pools

I'm working on an encryption library, and one of the steps is to encrypt a bunch of blocks in parallel. I'm using a very simple rayon implementation for that, and overall it works pretty well -- up to a point, more threads equals more performance. However, what I see is that even on a machine with plenty of headroom, 32 threads is beating 64, which makes me suspect that we can do better. Here's the naive implementation:

fn encrypt(
    block_pairs: Vec<(&[u8], &mut [u8])>,
    key: &Key,
) -> Result<()> {
    block_pairs
        .into_par_iter()
        .enumerate()
        .try_for_each(|(block_num, (plaintext, ciphertext))| {
            encrypt_block(block_num, plaintext, ciphertext, key)
        })
}

Based on some perf dumps though, there's a decent amount of overhead from rayon at some point, and intuitively I'd imagine that we could improve the performance here if each thread did a better job with data locality, so that it isn't hopping all over the buffers and is instead staying in a more bounded memory region.

So okay, let's try something like this:

let pool = ThreadPoolBuilder::new()
    .num_threads(num_threads)
    .build()?;

let chunk_size = (block_pairs.len() + num_threads - 1) / num_threads;

pool.install(|| {
    let chunks: Vec<_> = block_pairs.chunks_mut(chunk_size).enumerate().collect();
    pool.scope(|s| {
        for (chunk_index, chunk) in chunks {
            s.spawn(move |_| {
                chunk
                    .iter_mut()
                    .enumerate()
                    .try_for_each(|(i, (plaintext, ciphertext))| {
                        let block_num = chunk_index * chunk_size + i + block_offset;
                        encrypt_block(block_num, plaintext, ciphertext, key)
                    })
                    .unwrap();
            });
        }
    });
});

Now we process contiguous chunks linearly, but we spawn a bit of work for rayon across each chunk. Now, this actually does improve performance across many many threads, but not enough that the 64 thread case beats the 32 thread case. I tried something similar with with_min_len, and that's similarly not very impactful.

My other note from profiling is that on a machine with a lot of cores, rayon is happy to bounce around between all of them, and my suspicion is that isn't helping either, but I could certainly be wrong there. I'm half tempted to use actual std::thread, but that's quite a bit messier and I'd love this to be relatively simple.

Are there any other parallelism strategies I should be considering, or simple rayon tweaks that might improve performance across many many threads? I did a quick test of splitting up the buffers into regions first, and that actually made it quite a bit worse, but I certainly might have messed that up.

I'm gonna test out a crossbeam thread version as well, but I'd appreciate any thoughts!

Update: the crossbeam version does quite a bit better with large thread pools -- 64 threads approximately matches 32, rather than regressing significantly. But obviously the ideal would be that 64 is just twice as fast as 32, which is of course not going to be how it actually shakes out, but this still makes me think I'm seeing significant contention somewhere.

crossbeam::thread::scope(|s| {
    for (chunk_index, chunk) in block_pairs.chunks_mut(chunk_size).enumerate() {
        s.spawn(move |_| {
            chunk
                .iter_mut()
                .enumerate()
                .try_for_each(|(i, (plaintext, ciphertext))| {
                    let block_num = chunk_index * chunk_size + i + block_offset;
                    encrypt_block(block_num, plaintext, ciphertext, key)
                })
                .unwrap();
        });
    }

Playing with thread pinning now, but if there's a friendlier method that'd be neat. :smile:

Amdhahl's law

At some point the serial part of the algorithm outweighs the parallel part. And you can't get below that.

Certainly! I don't think I've actually hit that plateau yet though, since playing with the threading methods is able to significantly speed up the 64 thread case still even just with some fairly basic tasks. Now, I'm definitely open to the idea that the naive rayon implementation is fairly close to optimal, but my gut is that there's quite a bit of headroom since this task can parallelize quite cleanly.

Okay core_affinity makes a huge difference here; nearly doubled the throughput with the 64 and 96 thread use cases. Gonna go ahead and call that a good enough answer for now: crossbeam plus core_affinity gets me dramatically improved performance.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.