Rayon with expensive-to-construct/combine accumulator

jacg · July 12, 2022, 6:18pm

I have a datastructure which is pretty expensive to construct.

let mut datastructure = empty_datastructure();

I have a bazillion data, which I place into this datastructure, very roughly like this:

for datum in data {
    put_in(datasructure, datum)
}

I am trying to parallelize this process with rayon, roughly like this:

data.par_iter()
    .fold  (empty_datastructure, put_in)
    .reduce(empty_datastructure, combine_datasructures)

It turns out that rayon calls empty_datastructure (and, consequently, combine_datasructures) many more times than there are threads in the thread pool, with the result that the process slows down enormously, because it spends lots of time on creating and combining lots of instances of the expensive datastructure.

Is there some way of persuading rayon to create only one accumulator per thread?

Or is rayon simply not the right tool for this kind of problem?

cuviper · July 12, 2022, 7:02pm

Rayon tries to be "adaptive" in its job splitting, in case the work load would not be balanced in a perfect per-thread split, but that does tend to be more eager about splitting up the work. I haven't figured out a good way to let it keep using the same accumulator when the second part of a split doesn't get stolen to new thread, but that would be ideal...

I have rayon#857 trying to dial back the splitting a little, but that could use some real-world benchmarking to give it confidence. You can also try with_min_len to put your own limit on how small it's allowed to split.

kornel · July 12, 2022, 8:27pm

map_init was too slow for me, so I've used thread-local storage instead:

github.com

ImageOptim/libimagequant/blob/953f17794a1b51586552f18a650b5534e2903f28/src/kmeans.rs#L62-L67


      
          let tls = ThreadLocal::new();
          let total = hist.total_perceptual_weight;
          
          
// chunk size is a trade-off between parallelization and overhead
          hist.items.par_chunks_mut(256).for_each(|batch| {
              let kmeans = tls.get_or(move || RefCell::new(Kmeans::new(len)));

github.com

ImageOptim/libimagequant/blob/260e963d542c6effeaa25d0d156f6f6bfe2a33e2/src/remap.rs#L38-L60


      
          let tls = ThreadLocal::new();
          let per_thread_buffers = move || -> Result<_, Error> { Ok(RefCell::new((Kmeans::new(palette_len)?, temp_buf(width)?, temp_buf(width)?, temp_buf(width)?))) };
          
          
let tls_tmp1 = tls.get_or_try(per_thread_buffers)?;
          let mut tls_tmp = tls_tmp1.borrow_mut();
          
          
let input_rows = image.px.rows_iter(&mut tls_tmp.1)?;
          let (background, transparent_index) = image.background.as_mut().map(|background| {
              (Some(background), n.search(&f_pixel::default(), 0).0)
          })
          .filter(|&(_, transparent_index)| colors[usize::from(transparent_index)].a < MIN_OPAQUE_A)
          .unwrap_or((None, 0));
          let background = background.map(|bg| bg.px.rows_iter(&mut tls_tmp.1)).transpose()?;
          
          
if background.is_some() {
              tls_tmp.0.update_color(f_pixel::default(), 1., transparent_index);
          }
          
          
drop(tls_tmp);

This file has been truncated. show original

jacg · July 12, 2022, 9:40pm

I'm sure that lots more performance could be squeezed out, but simply using with_min_len(data.len() / num_threads) (where num_threads is currently picked by the user) already improves the performance enough for the bottleneck of the whole program to have moved elsewhere.

For now, this will do.

Thanks!

system · October 10, 2022, 9:41pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Dynamically using (or not using) rayon help	11	3620	January 12, 2023
Parallelization advice help	7	691	March 3, 2021
Per-thread storage patterns & custom reductions help	6	836	September 5, 2022
Using rayon for parallel tasks	5	1809	May 26, 2023
Limiting buffer memory in parallel code help	6	570	May 18, 2021

Rayon with expensive-to-construct/combine accumulator

Related topics