Performance issue with C-array like computation (2 times worst than naive java)

I've looked a little bit more into this curious min3 behaviour and I think I've realized what's going on!

The first hypothesis was that one of the branches was really being taken significantly more often than the others, but it seems that d11, d10 and d01 “win” about the same number of times. The pattern might be predictable though, although I haven't tested it.

So what's happening is basically what @vitalyd is saying about independent data paths – notice that when d11 or d10 wins, the whole code is basically parallel! The serial dependency occurs only when d01 wins (about one third of cases). So I pushed the idea further and tried to separate parallel path and serial path (full code):

                let d = {
                    let d11 = prev[idx_col - 1];
                    let d01 = curr[idx_col - 1];
                    let d10 = prev[idx_col];
                    let mmm = min2(d11, d10);
                    if d01 < mmm {
                        d01 + self.square_dist(idx_line, idx_col, other)
                    } else {
                        mmm + self.square_dist(idx_line, idx_col, other)
                    }
                };

So this is the basically the fastest version on my CPU! (the min2 is implemented as if x < y { x } else { y } and compiles to single minsd instruction)

Also, quite interesting observation, when instead of min3(d11, d01, d10) you do min2(min2(d11, d10), d01), you also get some nice speedup (note the slight change in argument order). I presume that's because you don't have to wait on d01 when computing first min2.

The times I get on my machine – with native (and without):

  • min3(d11, d01, d10) (branchless) – 21s (35s),
  • min3(d11, d10, d01) (branchless) – 18s (27s)
  • min2(min2(d11, d10), d01) (also branchless) – 15s (25s),
  • min3(d11, d01, d10) (match version) – 14s (24s),
  • this posts's version – 9s (30s‽)

Anyway, bound checks in this and @matklad's version are not elided, but the CPU doesn't seem to care. I've tried to get rid of the bound checks by rewriting the loop to use iterators, the bound-checks were gone, but with no speedup (or with speedup < 10%). (That doesn't mean it's not worth to eliminate bound-checks in general – it's really important if you expect the compiler to vectorize the code)

1 Like