Limiting buffer memory in parallel code

If I have serial code like:

let data: Vec<u32> = generate_data(); // About 30 megabytes of data.
let mut buffer: Vec<u32> = Vec::with_capacity(data.len());
let result = (0 .. 30)
             .map(|i| my_func(&data, i, &mut buffer))
             .max()
             .unwrap();

Where:

fn my_func(data: &[u32], i: usize, &mut Vec<u32>: buffer) -> usize {
    buffer.clear(); // Keeps capacity.
    // ...
}

I've measured a moderate speedup in a simple parallel version like this usign Rayon:

let data: Vec<u32> = generate_data();
let result = (0 .. 30)
             .into_par_iter()
             .map(|i| my_func(&data, i, &mut Vec::with_capacity(data.len()))
             .max()
             .unwrap();

But that generates and allocates a new buffer 30 times (and each buffer is about 30 MB). Do you know a way to write the code to allocate less buffers, like the same as the number of cores (like four or eight)? Hopefully it should be faster (and probably also use less memory) than the naive Rayon version above.

I think map_init will suit you well for your buffer. It's not exactly per-thread/core, rather per-task as rayon splits the work up dynamically, but it should still do better than your naive version.

1 Like

Thank you for your suggestion. Unfortunately I haven't seen any speedup. So I've tried to count how many jobs Rayon creates with code like (this is slightly modified from my code, so it may not be 100% correct. Also, I am rather ignorant about parallelism so far):

use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
let counter = Arc::new(AtomicUsize::new(0));

let data: Vec<u32> = generate_data();
let result = (0 .. 30)
             .into_par_iter()
             .map_init(|| { counter.fetch_add(1, Ordering::SeqCst); data.to_vec() },
                       |mut buf, i| my_func(i, &mut buf))             
             .max()
             .unwrap();
             
println!("{:?}", Arc::try_unwrap(counter)); // prints 30

And it seems to print 30, instead of a more reasonable number as 8. How do I reduce the number of jobs?

Make sure you've measured that allocations are a significant part of your profile, otherwise even perfect optimization of this will do you little good.

Yes, that's a consequence of rayon's adaptive work stealing -- we don't know whether the job will move to a different thread or not when we're splitting, so map_init assumes the worst. Even when two jobs do run on the same thread, they may be nested if one blocked on any rayon call and started executing the other, so we can't give them the same aliased &mut T.

You can set a lower bound using .with_min_len(N), but note that it will still split in halves, so it won't necessarily reach your minimum exactly. For example, if you have 30 items with min 10, it will first split to 15+15, and then it can't go further because that would make them less than 10.

Make sure you've measured that allocations are a significant part of your profile, otherwise even perfect optimization of this will do you little good.

It seems memory allocations take a sufficient percentage of the run-time in both the serial and parallel versions. That's why I used that buffer optimization in the serial version too.

.with_min_len(N) reduces the number of jobs. With num_threads() to ignore part of the virtual cores, the code is now faster than the naive parallel version. I think this is enough. Thank you :slight_smile:

1 Like

Actually, I am surprised that it's splitting entirely to single items -- I expect some sharing here. On my 16-thread system, I have to crank it up to around 200 items before I get even minimal sharing. I will definitely look into this...

1 Like