Performance issue with C-array like computation (2 times worst than naive java)

krdln · March 8, 2017, 3:20am

I've looked a little bit more into this curious min3 behaviour and I think I've realized what's going on!

The first hypothesis was that one of the branches was really being taken significantly more often than the others, but it seems that d11, d10 and d01 “win” about the same number of times. The pattern might be predictable though, although I haven't tested it.

So what's happening is basically what @vitalyd is saying about independent data paths – notice that when d11 or d10 wins, the whole code is basically parallel! The serial dependency occurs only when d01 wins (about one third of cases). So I pushed the idea further and tried to separate parallel path and serial path (full code):

                let d = {
                    let d11 = prev[idx_col - 1];
                    let d01 = curr[idx_col - 1];
                    let d10 = prev[idx_col];
                    let mmm = min2(d11, d10);
                    if d01 < mmm {
                        d01 + self.square_dist(idx_line, idx_col, other)
                    } else {
                        mmm + self.square_dist(idx_line, idx_col, other)
                    }
                };

So this is the basically the fastest version on my CPU! (the min2 is implemented as if x < y { x } else { y } and compiles to single minsd instruction)

Also, quite interesting observation, when instead of min3(d11, d01, d10) you do min2(min2(d11, d10), d01), you also get some nice speedup (note the slight change in argument order). I presume that's because you don't have to wait on d01 when computing first min2.

The times I get on my machine – with native (and without):

min3(d11, d01, d10) (branchless) – 21s (35s),
min3(d11, d10, d01) (branchless) – 18s (27s)
min2(min2(d11, d10), d01) (also branchless) – 15s (25s),
min3(d11, d01, d10) (match version) – 14s (24s),
this posts's version – 9s (30s‽)

Anyway, bound checks in this and @matklad's version are not elided, but the CPU doesn't seem to care. I've tried to get rid of the bound checks by rewriting the loop to use iterators, the bound-checks were gone, but with no speedup (or with speedup < 10%). (That doesn't mean it's not worth to eliminate bound-checks in general – it's really important if you expect the compiler to vectorize the code)

Topic		Replies	Views
Looking for help understanding Rust's performance vs C++ community	28	8429	November 1, 2019
Performance issue - High complexity code help	17	1639	July 1, 2020
How to use bit vector efficiently help	3	582	July 3, 2021
Rust optimisation help	16	2755	January 12, 2023
Performance of array access vs C	32	5316	August 30, 2020

Performance issue with C-array like computation (2 times worst than naive java)

Related topics