Help with increasing performance in multi_cartesian_product like functions

7sharp9 · February 17, 2020, 9:08pm

Hi, this is an extension of my other ask for help creating a rust version of the applicative sequence function. While I have been able to replicate the function (with some help) Ive yet to find decent performance and would like to understand why.

I've created a little repo with some alternative functions and one that uses multi_cartesian_product from Itertools:

Here is the repo: repo

These are the performance results I get:

test tests::multi_cartesian_product_test ... bench:       9,441 ns/iter (+/- 933)
test tests::sequence2_test               ... bench:      20,458 ns/iter (+/- 1,710)
test tests::sequence_idiomatic_test      ... bench:      20,919 ns/iter (+/- 2,039)
test tests::sequence_test                ... bench:      20,754 ns/iter (+/- 2,069)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

With an F# version I'm getting a lot different results:

|        Method |     Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------- |---------:|----------:|----------:|------:|--------:|-------:|------:|------:|----------:|
|      Sequence | 3.768 us | 0.0433 us | 0.0384 us |  1.00 |    0.00 | 2.8992 |     - |     - |    8.9 KB |
| Sequence_Comp | 4.417 us | 0.0500 us | 0.0467 us |  1.17 |    0.02 | 2.3117 |     - |     - |   7.09 KB |

You can see the F# results are quite different, ~3x faster than multi_cartesian_product_test. I would like to understand why the rust version is slower and understand what I can do to speed it up

7sharp9 · February 17, 2020, 9:11pm

For completeness and easy viewing here are the three functions:

pub fn sequence_idiomatic<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    if let Some((item, items)) = v.split_first() {
        let state = sequence_idiomatic(items);
        let mut result = vec![];
        for x in item {
            for xs in &state {
                let mut v = vec![x];
                v.extend(xs);
                result.push(v)
            }
        }
        result
    } else {
        vec![vec![]]
    }
}

pub fn sequence2<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    v.iter().rfold(vec![vec![]], |state, item| {
        item.iter()
            .flat_map(|x| {
                state.iter().map(move |xs| {
                    let mut v = vec![x];
                    v.extend(xs);
                    v
                })
            })
            .collect()
    })
}

pub fn sequence<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    v.iter().rfold(vec![vec![]], |state, item| {
        let mut result = vec![];
        for x in item {
            for xs in &state {
                let mut v = vec![x];
                v.extend(xs);
                result.push(v)
            }
        }
        result
    })
}

alice · February 17, 2020, 10:46pm

I came up with two different approaches: by_column and by_row that have comparable performance to multi_cartesian_product_test. On the original tests I get:

running 6 tests
test tests::multi_cartesian_product_test ... bench:       2,597 ns/iter (+/- 125)
test tests::sequence2_test               ... bench:       4,797 ns/iter (+/- 321)
test tests::sequence_by_column_test      ... bench:       2,405 ns/iter (+/- 150)
test tests::sequence_by_row_test         ... bench:       2,793 ns/iter (+/- 78)
test tests::sequence_idiomatic_test      ... bench:       4,414 ns/iter (+/- 87)
test tests::sequence_test                ... bench:       4,305 ns/iter (+/- 118)

You can view the code here:

By row

pub fn sequence_by_row<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    let count: usize = v.iter().map(|v| v.len()).product();

    let mut res = Vec::with_capacity(count);
    for _ in 0..count {
        res.push(Vec::with_capacity(v.len()));
    }

    let mut counter = vec![0; v.len()];
    for vec in &mut res {
        for (vi, idx) in v.iter().zip(&counter) {
            vec.push(&vi[*idx]);
        }
        incr_counter(&mut counter, v);
    }

    res
}

#[inline]
fn incr_counter<T>(counter: &mut [usize], v: &[Vec<T>]) {
    for (max, cntr) in v.iter().map(|vi| vi.len()).zip(counter).rev() {
        *cntr += 1;
        if *cntr == max {
            *cntr = 0;
        } else {
            break;
        }
    }
}

By column

pub fn sequence_by_column<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    let count: usize = v.iter().map(|v| v.len()).product();

    let mut res = Vec::with_capacity(count);
    for _ in 0..count {
        res.push(Vec::with_capacity(v.len()));
    }

    let mut per_row = count;
    for vi in v {
        per_row /= v.len();
        let mut vidx = 0;
        let mut row_cnt = 0;

        for vec in &mut res {
            vec.push(&vi[vidx]);
            row_cnt += 1;
            if row_cnt == per_row {
                row_cnt = 0;
                vidx += 1;
                if vidx == vi.len() {
                    vidx = 0;
                }
            }
        }
    }

    res
}

The main danger in your solution is excessive numbers of allocations. Every level of recursion in your sequence_idiomatic function allocates its own set of vectors, and in the end you throw away all but the last one. Allocations are rather expensive, and I assume this is why your solutions are slower.

Additionally this also answers why F# is faster: The expression x::xs creates a singly linked list with x as the head and xs as the tail. Often linked lists are not very fast because every item requires a separate allocation, but in this case each xs is reused many times, and by the virtue of shared ownership, F# avoids large amounts of copying that our solutions are forced to perform.

We can simulate this using Rc to share the tails of the linked lists. The results are below:

running 7 tests
test tests::multi_cartesian_product_test ... bench:       2,596 ns/iter (+/- 12)
test tests::sequence2_test               ... bench:       4,654 ns/iter (+/- 161)
test tests::sequence_by_column_test      ... bench:       2,518 ns/iter (+/- 19)
test tests::sequence_by_rc_test          ... bench:       1,659 ns/iter (+/- 70)
test tests::sequence_by_row_test         ... bench:       2,834 ns/iter (+/- 136)
test tests::sequence_idiomatic_test      ... bench:       3,091 ns/iter (+/- 224)
test tests::sequence_test                ... bench:       3,175 ns/iter (+/- 108)

Implementation of rc test

use std::rc::Rc;
pub enum RcChain<'a, T> {
    Cons {
        value: &'a T,
        tail: Rc<RcChain<'a, T>>,
    },
    Nil,
}

fn folder<'a, T>(state: Vec<Rc<RcChain<'a, T>>>, item: &'a Vec<T>) -> Vec<Rc<RcChain<'a, T>>> {
    let mut res = Vec::with_capacity(item.len() * state.len());
    for x in item {
        for xs in &state {
            res.push(Rc::new(RcChain::Cons {
                value: x,
                tail: Rc::clone(xs),
            }));
        }
    }
    res
}

pub fn sequence_by_rc<T>(v: &[Vec<T>]) -> Vec<Rc<RcChain<T>>> {
    v.iter().rfold(vec![Rc::new(RcChain::Nil)], folder)
}

// Convenience Debug impl that prints as list
use std::fmt;
impl<'a, T: fmt::Debug> fmt::Debug for RcChain<'a, T> {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        let mut list = f.debug_list();

        let mut node = self;
        while let RcChain::Cons { value, tail } = node {
            list.entry(value);
            node = &*tail;
        }

        list.finish()
    }
}

By reusing the same memory for the shared parts of different lists, the Rc solution can vastly reduce the amount of memory it needs to fill up.

7sharp9 · February 17, 2020, 10:50pm

Thank you for the detailed answer, sequence_idiomatic was an answer in my other question for an idiomatic rust version. The others were just variations I tried to compose using vectors rather than linked list, I know vectors are vastly different but I was trying to keep the implementation simple.

alice · February 17, 2020, 10:56pm

The function sequence_idiomatic is a quite reasonable implementation if you consider the constraint of using vectors and the specific algorithm proposed by the original F# code. My two suggestions use somewhat different algorithms to the one used by F#.

7sharp9 · February 17, 2020, 11:06pm

I get the following with your new functions

test tests::sequence_test                ... bench:      20,610 ns/iter (+/- 2,303)
test tests::sequence_idiomatic_test      ... bench:      20,327 ns/iter (+/- 2,320)
test tests::sequence2_test               ... bench:      20,189 ns/iter (+/- 4,280)
test tests::sequence_by_rc_test          ... bench:      11,329 ns/iter (+/- 1,458)
test tests::multi_cartesian_product_test ... bench:       9,027 ns/iter (+/- 2,125)
test tests::sequence_by_row_test         ... bench:       9,021 ns/iter (+/- 1,337)
test tests::sequence_by_column_test      ... bench:       8,616 ns/iter (+/- 1,020)

The difference to the .net versions is interesting.

alice · February 17, 2020, 11:07pm

I think it's quite interesting how the rc implementation get a quite different position on your machine compared to my laptop, where it's clearly the best.

7sharp9 · February 17, 2020, 11:08pm

Yeah I saw that too

lwojtow · February 17, 2020, 11:24pm

I've looked at sequence_idiomatic. Your original code was ~5k ns now it's 2.3k ns:

pub fn sequence_idiomatic<T>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    if let Some((item, items)) = v.split_first() {
        let state = sequence_idiomatic(items);
        item.iter().map(|x| {
            state.iter().map(|xs| {
                let mut v = Vec::with_capacity(xs.len() + 1);
                v.push(x);
                v.extend(xs);
                v
            }).flat_map(|f| f.into_iter()).collect::<Vec<_>>()
        }).collect()
    } else {
        vec![vec![]]
    }
}

Edit: surprisingly, adding rayon and par_iter makes it a lot slower!

7sharp9 · February 17, 2020, 11:30pm

test tests::multi_cartesian_product_test ... bench:       8,918 ns/iter (+/- 1,187)
test tests::sequence2_test               ... bench:      20,094 ns/iter (+/- 2,436)
test tests::sequence_by_column_test      ... bench:       8,220 ns/iter (+/- 873)
test tests::sequence_by_rc_test          ... bench:      11,290 ns/iter (+/- 923)
test tests::sequence_by_row_test         ... bench:       8,791 ns/iter (+/- 752)
test tests::sequence_idiomatic_test      ... bench:       8,280 ns/iter (+/- 1,773)
test tests::sequence_test                ... bench:      20,610 ns/iter (+/- 1,954)

Thats almost the fastest now @lwojtow

7sharp9 · February 17, 2020, 11:46pm

I actually had an extra toArray on the end of the F# version so its actually a little faster:

|        Method |     Mean |     Error |    StdDev |   Median | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------- |---------:|----------:|----------:|---------:|------:|--------:|-------:|------:|------:|----------:|
|      Sequence | 3.431 us | 0.0686 us | 0.1401 us | 3.369 us |  1.00 |    0.00 | 2.7313 |     - |     - |   8.38 KB |
| Sequence_Comp | 4.004 us | 0.0789 us | 0.1361 us | 3.954 us |  1.16 |    0.06 | 2.1439 |     - |     - |   6.57 KB |

7sharp9 · February 18, 2020, 10:59pm

For anyone interested here a repo with the F# code

7sharp9 · February 20, 2020, 3:18pm

This is indeed the fastest with larger Vec inputs by a large margin.

7sharp9 · February 20, 2020, 3:18pm

If you could use any data structures to model this would it alter your solution?

lwojtow · February 20, 2020, 3:30pm

If vecs are larger, then it may be a good idea to look at using rayon again.

alice · February 20, 2020, 3:46pm

I mean, the trick that makes the Rc solution faster is that it represents the data using a smaller data structure, by reusing the same memory for duplicated parts of the matrix. This makes constructing the object faster, but it makes accessing it slower than a bare 2d matrix, since every vector is now a linked list.

For example you could imaging lazily constructing the solution as you access the parts of it you need, which would give you a constant time constructor, but even larger access times.

As for multithreadedness, the by-row solution I posted can rather easily be extended to a multithreaded version.

7sharp9 · February 20, 2020, 6:18pm

Once you start using a bigger input 10 arrays of 4 vector the the different functions diverge in performance more:

test tests::multi_cartesian_product_test ... bench: 221,290,169 ns/iter (+/- 9,931,776)
test tests::sequence2_test               ... bench: 464,918,146 ns/iter (+/- 30,365,915)
test tests::sequence_by_column_test      ... bench: 289,621,849 ns/iter (+/- 18,620,580)
test tests::sequence_by_rc_test          ... bench: 238,063,918 ns/iter (+/- 20,960,759)
test tests::sequence_by_row_test         ... bench: 228,112,214 ns/iter (+/- 14,249,567)
test tests::sequence_idiomatic_test      ... bench:  22,008,643 ns/iter (+/- 4,881,970)
test tests::sequence_test                ... bench: 460,572,286 ns/iter (+/- 32,439,668)

alice · February 20, 2020, 6:34pm

That's because they're no longer doing the same thing!

The others:
[[1, 10, 100], [1, 10, 200], [1, 10, 300], [1, 20, 100], [1, 20, 200], [1, 20, 300], [1, 30, 100], [1, 30, 200], [1, 30, 300], [2, 10, 100], [2, 10, 200], [2, 10, 300], [2, 20, 100], [2, 20, 200], [2, 20, 300], [2, 30, 100], [2, 30, 200], [2, 30, 300], [3, 10, 100], [3, 10, 200], [3, 10, 300], [3, 20, 100], [3, 20, 200], [3, 20, 300], [3, 30, 100], [3, 30, 200], [3, 30, 300]]

Current impl of sequence_idiomatic:
[[1, 10, 100, 10, 200, 10, 300, 1, 20, 100, 20, 200, 20, 300, 1, 30, 100, 30, 200, 30, 300], [2, 10, 100, 10, 200, 10, 300, 2, 20, 100, 20, 200, 20, 300, 2, 30, 100, 30, 200, 30, 300], [3, 10, 100, 10, 200, 10, 300, 3, 20, 100, 20, 200, 20, 300, 3, 30, 100, 30, 200, 30, 300]]

There's no way you're going to beat the by-row solution with a factor of ten if you're constructing the same object as it is.

7sharp9 · February 20, 2020, 7:53pm

Oh wow, Ill revert the code, I never checked I assumed the implementer never altered that

alice · February 20, 2020, 8:11pm

I wrote a multi-threaded version of by-row, which you can find here:

Summary

#[inline]
fn set_counter_to<T>(counter: &mut [usize], v: &[Vec<T>], val: usize) {
    let mut val = val;
    for (max, cntr) in v.iter().map(|vi| vi.len()).zip(counter).rev() {
        *cntr = val % max;
        val = val / max;
    }
}

fn sequence_by_row_task<'a, 'b, T>(v: &'a [Vec<T>], out: &'b mut [Vec<&'a T>], off: usize) {
    let mut counter = vec![0; v.len()];
    set_counter_to(&mut counter, v, off);
    for vec in out {
        for (vi, idx) in v.iter().zip(&counter) {
            vec.push(&vi[*idx]);
        }
        incr_counter(&mut counter, v);
    }
}

fn sequence_by_row_join<'a, 'b, T: Send + Sync>(v: &'a [Vec<T>], out: &'b mut [Vec<&'a T>], off: usize, num: usize) {
    if num == 1 {
        return sequence_by_row_task(v, out, off);
    }
    let left_len = out.len() / 2;
    let right_len = out.len() - left_len;
    let (left, right) = out.split_at_mut(left_len);
    rayon::join(
        || {
            sequence_by_row_join(v, left, off, left_len);
        },
        || {
            sequence_by_row_join(v, right, off + left_len, right_len);
        },
    );
}

pub fn sequence_by_row_rayon<T: Send + Sync>(v: &[Vec<T>]) -> Vec<Vec<&T>> {
    let count: usize = v.iter().map(|v| v.len()).product();

    let mut res = Vec::with_capacity(count);
    for _ in 0..count {
        res.push(Vec::with_capacity(v.len()));
    }

    sequence_by_row_join(v, &mut res, 0, rayon::current_num_threads());

    res
}

It is, however, still slower than the single-threaded one even for rather large test cases on the order of 7 vectors of length 7. The code above has no interaction between the different threads — even memory allocation is done before working in the threads.

Based on this, I am going to conclude that the bottleneck is the transfer speed of data from the CPU to ram.

Topic		Replies	Views
Code generation for iterator chains	51	4146	January 12, 2023
The performance issue of returning multiple values(tuple) help	43	1148	February 25, 2025
Comparison of a simple code between Rust and C!	25	2058	January 12, 2023
Performance issue with C-array like computation (2 times worst than naive java) help	48	6764	January 12, 2023
Why are cartesian iterators slower than nested fors?	19	1917	August 20, 2020

Help with increasing performance in multi_cartesian_product like functions

Related topics