Help with increasing performance in multi_cartesian_product like functions

Ups, my bad.
I managed to get time down to 3.3k ns (from 5k ns) with this:

pub fn sequence_idiomatic<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    if let Some((item, items)) = v.split_first() {
        let state = sequence_idiomatic(items);
        let mut result = Vec::with_capacity(item.len() * state.len());
        for x in item {
            for xs in &state {
                let mut v = Vec::with_capacity(xs.len() + 1);
                v.push(x);
                v.extend(xs);
                result.push(v)
            }
        }
        result
    } else {
        vec![vec![]]
    }
}
1 Like

Multithreading is incredibly difficult to get right and it only becomes more difficult the smaller the task to parallelize is.

The problem with splitting arrays like that for multithreading is, that CPU cores must have exclusive access to cache lines, not just the individual elements, to be able to properly parallelize the code execution. The most common cache line size is 64 bytes. On a 64 bit processor Vec has a size of 24 bytes. Arrays of Vecs have the problem, that the Vecs will span across 2 cache lines, occasionally. There's several solutions to that problem:

  • Only split the array in multiples of 192 bits (3 cache lines (3 * 64 bits) & 8 vectors (8 * 24 bits)).
  • Wrap the vector in a struct with an alignment of 32 bytes; the size of the wrapped vector will be 32 bits, i.e. 2 vectors will fit exactly into 1 cache line. Then you can split in multiples of 2 (vectors).
  • Wrap a pair of vectors in a struct with alignment of 64 bytes, the second vector wrapped in an Option; the struct will be 64 bytes large and fit exactly into a single cache line.

The first approach doesn't require any structural changes, but also enforces no correct handling on the user side. The slice also has to be properly aligned.
The second approach enforces vectors to not be split across multiple cache lines, but still doesn't enforce correct splitting of multiple vectors.
The last approach enforces, that you can only split between cache lines, but adds additional complexity to the implementation.

EDIT: Try

let left_len = out.len() / 8 * 4;

In summary here are my new timings:

test tests::multi_cartesian_product_test ... bench: 225,028,655 ns/iter (+/- 16,593,190)
test tests::sequence2_test               ... bench: 469,458,936 ns/iter (+/- 40,877,917)
test tests::sequence_by_column_test      ... bench: 291,496,908 ns/iter (+/- 18,464,653)
test tests::sequence_by_rc_test          ... bench: 237,987,335 ns/iter (+/- 17,133,112)
test tests::sequence_by_row_reyon_test   ... bench: 249,872,672 ns/iter (+/- 18,108,560)
test tests::sequence_by_row_test         ... bench: 212,651,089 ns/iter (+/- 11,376,876)
test tests::sequence_idiomatic_test      ... bench: 268,314,107 ns/iter (+/- 25,136,008)
test tests::sequence_test                ... bench: 422,507,075 ns/iter (+/- 26,002,879)

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.