Help with increasing performance in multi_cartesian_product like functions

Hi, this is an extension of my other ask for help creating a rust version of the applicative sequence function. While I have been able to replicate the function (with some help) Ive yet to find decent performance and would like to understand why.

I've created a little repo with some alternative functions and one that uses multi_cartesian_product from Itertools:

Here is the repo: repo

These are the performance results I get:

test tests::multi_cartesian_product_test ... bench:       9,441 ns/iter (+/- 933)
test tests::sequence2_test               ... bench:      20,458 ns/iter (+/- 1,710)
test tests::sequence_idiomatic_test      ... bench:      20,919 ns/iter (+/- 2,039)
test tests::sequence_test                ... bench:      20,754 ns/iter (+/- 2,069)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

With an F# version I'm getting a lot different results:

|        Method |     Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------- |---------:|----------:|----------:|------:|--------:|-------:|------:|------:|----------:|
|      Sequence | 3.768 us | 0.0433 us | 0.0384 us |  1.00 |    0.00 | 2.8992 |     - |     - |    8.9 KB |
| Sequence_Comp | 4.417 us | 0.0500 us | 0.0467 us |  1.17 |    0.02 | 2.3117 |     - |     - |   7.09 KB |

You can see the F# results are quite different, ~3x faster than multi_cartesian_product_test. I would like to understand why the rust version is slower and understand what I can do to speed it up :slight_smile:

1 Like

For completeness and easy viewing here are the three functions:

pub fn sequence_idiomatic<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    if let Some((item, items)) = v.split_first() {
        let state = sequence_idiomatic(items);
        let mut result = vec![];
        for x in item {
            for xs in &state {
                let mut v = vec![x];
                v.extend(xs);
                result.push(v)
            }
        }
        result
    } else {
        vec![vec![]]
    }
}

pub fn sequence2<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    v.iter().rfold(vec![vec![]], |state, item| {
        item.iter()
            .flat_map(|x| {
                state.iter().map(move |xs| {
                    let mut v = vec![x];
                    v.extend(xs);
                    v
                })
            })
            .collect()
    })
}

pub fn sequence<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    v.iter().rfold(vec![vec![]], |state, item| {
        let mut result = vec![];
        for x in item {
            for xs in &state {
                let mut v = vec![x];
                v.extend(xs);
                result.push(v)
            }
        }
        result
    })
}

I came up with two different approaches: by_column and by_row that have comparable performance to multi_cartesian_product_test. On the original tests I get:

running 6 tests
test tests::multi_cartesian_product_test ... bench:       2,597 ns/iter (+/- 125)
test tests::sequence2_test               ... bench:       4,797 ns/iter (+/- 321)
test tests::sequence_by_column_test      ... bench:       2,405 ns/iter (+/- 150)
test tests::sequence_by_row_test         ... bench:       2,793 ns/iter (+/- 78)
test tests::sequence_idiomatic_test      ... bench:       4,414 ns/iter (+/- 87)
test tests::sequence_test                ... bench:       4,305 ns/iter (+/- 118)

You can view the code here:

By row
pub fn sequence_by_row<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    let count: usize = v.iter().map(|v| v.len()).product();

    let mut res = Vec::with_capacity(count);
    for _ in 0..count {
        res.push(Vec::with_capacity(v.len()));
    }

    let mut counter = vec![0; v.len()];
    for vec in &mut res {
        for (vi, idx) in v.iter().zip(&counter) {
            vec.push(&vi[*idx]);
        }
        incr_counter(&mut counter, v);
    }

    res
}

#[inline]
fn incr_counter<T>(counter: &mut [usize], v: &[Vec<T>]) {
    for (max, cntr) in v.iter().map(|vi| vi.len()).zip(counter).rev() {
        *cntr += 1;
        if *cntr == max {
            *cntr = 0;
        } else {
            break;
        }
    }
}
By column
pub fn sequence_by_column<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    let count: usize = v.iter().map(|v| v.len()).product();

    let mut res = Vec::with_capacity(count);
    for _ in 0..count {
        res.push(Vec::with_capacity(v.len()));
    }

    let mut per_row = count;
    for vi in v {
        per_row /= v.len();
        let mut vidx = 0;
        let mut row_cnt = 0;

        for vec in &mut res {
            vec.push(&vi[vidx]);
            row_cnt += 1;
            if row_cnt == per_row {
                row_cnt = 0;
                vidx += 1;
                if vidx == vi.len() {
                    vidx = 0;
                }
            }
        }
    }

    res
}

The main danger in your solution is excessive numbers of allocations. Every level of recursion in your sequence_idiomatic function allocates its own set of vectors, and in the end you throw away all but the last one. Allocations are rather expensive, and I assume this is why your solutions are slower.

Additionally this also answers why F# is faster: The expression x::xs creates a singly linked list with x as the head and xs as the tail. Often linked lists are not very fast because every item requires a separate allocation, but in this case each xs is reused many times, and by the virtue of shared ownership, F# avoids large amounts of copying that our solutions are forced to perform.

We can simulate this using Rc to share the tails of the linked lists. The results are below:

running 7 tests
test tests::multi_cartesian_product_test ... bench:       2,596 ns/iter (+/- 12)
test tests::sequence2_test               ... bench:       4,654 ns/iter (+/- 161)
test tests::sequence_by_column_test      ... bench:       2,518 ns/iter (+/- 19)
test tests::sequence_by_rc_test          ... bench:       1,659 ns/iter (+/- 70)
test tests::sequence_by_row_test         ... bench:       2,834 ns/iter (+/- 136)
test tests::sequence_idiomatic_test      ... bench:       3,091 ns/iter (+/- 224)
test tests::sequence_test                ... bench:       3,175 ns/iter (+/- 108)
Implementation of rc test
use std::rc::Rc;
pub enum RcChain<'a, T> {
    Cons {
        value: &'a T,
        tail: Rc<RcChain<'a, T>>,
    },
    Nil,
}

fn folder<'a, T>(state: Vec<Rc<RcChain<'a, T>>>, item: &'a Vec<T>) -> Vec<Rc<RcChain<'a, T>>> {
    let mut res = Vec::with_capacity(item.len() * state.len());
    for x in item {
        for xs in &state {
            res.push(Rc::new(RcChain::Cons {
                value: x,
                tail: Rc::clone(xs),
            }));
        }
    }
    res
}

pub fn sequence_by_rc<T>(v: &[Vec<T>]) -> Vec<Rc<RcChain<T>>> {
    v.iter().rfold(vec![Rc::new(RcChain::Nil)], folder)
}

// Convenience Debug impl that prints as list
use std::fmt;
impl<'a, T: fmt::Debug> fmt::Debug for RcChain<'a, T> {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        let mut list = f.debug_list();

        let mut node = self;
        while let RcChain::Cons { value, tail } = node {
            list.entry(value);
            node = &*tail;
        }

        list.finish()
    }
}

By reusing the same memory for the shared parts of different lists, the Rc solution can vastly reduce the amount of memory it needs to fill up.

7 Likes

Thank you for the detailed answer, sequence_idiomatic was an answer in my other question for an idiomatic rust version. The others were just variations I tried to compose using vectors rather than linked list, I know vectors are vastly different but I was trying to keep the implementation simple.

The function sequence_idiomatic is a quite reasonable implementation if you consider the constraint of using vectors and the specific algorithm proposed by the original F# code. My two suggestions use somewhat different algorithms to the one used by F#.

I get the following with your new functions

test tests::sequence_test                ... bench:      20,610 ns/iter (+/- 2,303)
test tests::sequence_idiomatic_test      ... bench:      20,327 ns/iter (+/- 2,320)
test tests::sequence2_test               ... bench:      20,189 ns/iter (+/- 4,280)
test tests::sequence_by_rc_test          ... bench:      11,329 ns/iter (+/- 1,458)
test tests::multi_cartesian_product_test ... bench:       9,027 ns/iter (+/- 2,125)
test tests::sequence_by_row_test         ... bench:       9,021 ns/iter (+/- 1,337)
test tests::sequence_by_column_test      ... bench:       8,616 ns/iter (+/- 1,020)

The difference to the .net versions is interesting.

I think it's quite interesting how the rc implementation get a quite different position on your machine compared to my laptop, where it's clearly the best.

Yeah I saw that too :slight_smile:

I've looked at sequence_idiomatic. Your original code was ~5k ns now it's 2.3k ns:

pub fn sequence_idiomatic<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    if let Some((item, items)) = v.split_first() {
        let state = sequence_idiomatic(items);
        item.iter().map(|x| {
            state.iter().map(|xs| {
                let mut v = Vec::with_capacity(xs.len() + 1);
                v.push(x);
                v.extend(xs);
                v
            }).flat_map(|f| f.into_iter()).collect::<Vec<_>>()
        }).collect()
    } else {
        vec![vec![]]
    }
}

Edit: surprisingly, adding rayon and par_iter makes it a lot slower!

1 Like
test tests::multi_cartesian_product_test ... bench:       8,918 ns/iter (+/- 1,187)
test tests::sequence2_test               ... bench:      20,094 ns/iter (+/- 2,436)
test tests::sequence_by_column_test      ... bench:       8,220 ns/iter (+/- 873)
test tests::sequence_by_rc_test          ... bench:      11,290 ns/iter (+/- 923)
test tests::sequence_by_row_test         ... bench:       8,791 ns/iter (+/- 752)
test tests::sequence_idiomatic_test      ... bench:       8,280 ns/iter (+/- 1,773)
test tests::sequence_test                ... bench:      20,610 ns/iter (+/- 1,954)

Thats almost the fastest now @lwojtow

I actually had an extra toArray on the end of the F# version so its actually a little faster:

|        Method |     Mean |     Error |    StdDev |   Median | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------- |---------:|----------:|----------:|---------:|------:|--------:|-------:|------:|------:|----------:|
|      Sequence | 3.431 us | 0.0686 us | 0.1401 us | 3.369 us |  1.00 |    0.00 | 2.7313 |     - |     - |   8.38 KB |
| Sequence_Comp | 4.004 us | 0.0789 us | 0.1361 us | 3.954 us |  1.16 |    0.06 | 2.1439 |     - |     - |   6.57 KB |

For anyone interested here a repo with the F# code

This is indeed the fastest with larger Vec inputs by a large margin.

If you could use any data structures to model this would it alter your solution?

If vecs are larger, then it may be a good idea to look at using rayon again.

I mean, the trick that makes the Rc solution faster is that it represents the data using a smaller data structure, by reusing the same memory for duplicated parts of the matrix. This makes constructing the object faster, but it makes accessing it slower than a bare 2d matrix, since every vector is now a linked list.

For example you could imaging lazily constructing the solution as you access the parts of it you need, which would give you a constant time constructor, but even larger access times.

As for multithreadedness, the by-row solution I posted can rather easily be extended to a multithreaded version.

Once you start using a bigger input 10 arrays of 4 vector the the different functions diverge in performance more:

test tests::multi_cartesian_product_test ... bench: 221,290,169 ns/iter (+/- 9,931,776)
test tests::sequence2_test               ... bench: 464,918,146 ns/iter (+/- 30,365,915)
test tests::sequence_by_column_test      ... bench: 289,621,849 ns/iter (+/- 18,620,580)
test tests::sequence_by_rc_test          ... bench: 238,063,918 ns/iter (+/- 20,960,759)
test tests::sequence_by_row_test         ... bench: 228,112,214 ns/iter (+/- 14,249,567)
test tests::sequence_idiomatic_test      ... bench:  22,008,643 ns/iter (+/- 4,881,970)
test tests::sequence_test                ... bench: 460,572,286 ns/iter (+/- 32,439,668)

That's because they're no longer doing the same thing!

The others:
[[1, 10, 100], [1, 10, 200], [1, 10, 300], [1, 20, 100], [1, 20, 200], [1, 20, 300], [1, 30, 100], [1, 30, 200], [1, 30, 300], [2, 10, 100], [2, 10, 200], [2, 10, 300], [2, 20, 100], [2, 20, 200], [2, 20, 300], [2, 30, 100], [2, 30, 200], [2, 30, 300], [3, 10, 100], [3, 10, 200], [3, 10, 300], [3, 20, 100], [3, 20, 200], [3, 20, 300], [3, 30, 100], [3, 30, 200], [3, 30, 300]]

Current impl of sequence_idiomatic:
[[1, 10, 100, 10, 200, 10, 300, 1, 20, 100, 20, 200, 20, 300, 1, 30, 100, 30, 200, 30, 300], [2, 10, 100, 10, 200, 10, 300, 2, 20, 100, 20, 200, 20, 300, 2, 30, 100, 30, 200, 30, 300], [3, 10, 100, 10, 200, 10, 300, 3, 20, 100, 20, 200, 20, 300, 3, 30, 100, 30, 200, 30, 300]]

There's no way you're going to beat the by-row solution with a factor of ten if you're constructing the same object as it is.

1 Like

Oh wow, Ill revert the code, I never checked I assumed the implementer never altered that :slight_smile:

I wrote a multi-threaded version of by-row, which you can find here:

Summary
#[inline]
fn set_counter_to<T>(counter: &mut [usize], v: &[Vec<T>], val: usize) {
    let mut val = val;
    for (max, cntr) in v.iter().map(|vi| vi.len()).zip(counter).rev() {
        *cntr = val % max;
        val = val / max;
    }
}

fn sequence_by_row_task<'a, 'b, T>(v: &'a [Vec<T>], out: &'b mut [Vec<&'a T>], off: usize) {
    let mut counter = vec![0; v.len()];
    set_counter_to(&mut counter, v, off);
    for vec in out {
        for (vi, idx) in v.iter().zip(&counter) {
            vec.push(&vi[*idx]);
        }
        incr_counter(&mut counter, v);
    }
}

fn sequence_by_row_join<'a, 'b, T: Send + Sync>(v: &'a [Vec<T>], out: &'b mut [Vec<&'a T>], off: usize, num: usize) {
    if num == 1 {
        return sequence_by_row_task(v, out, off);
    }
    let left_len = out.len() / 2;
    let right_len = out.len() - left_len;
    let (left, right) = out.split_at_mut(left_len);
    rayon::join(
        || {
            sequence_by_row_join(v, left, off, left_len);
        },
        || {
            sequence_by_row_join(v, right, off + left_len, right_len);
        },
    );
}

pub fn sequence_by_row_rayon<T: Send + Sync>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    let count: usize = v.iter().map(|v| v.len()).product();

    let mut res = Vec::with_capacity(count);
    for _ in 0..count {
        res.push(Vec::with_capacity(v.len()));
    }

    sequence_by_row_join(v, &mut res, 0, rayon::current_num_threads());

    res
}

It is, however, still slower than the single-threaded one even for rather large test cases on the order of 7 vectors of length 7. The code above has no interaction between the different threads — even memory allocation is done before working in the threads.

Based on this, I am going to conclude that the bottleneck is the transfer speed of data from the CPU to ram.

2 Likes