I've looked a little bit more into this curious min3
behaviour and I think I've realized what's going on!
The first hypothesis was that one of the branches was really being taken significantly more often than the others, but it seems that d11
, d10
and d01
“win” about the same number of times. The pattern might be predictable though, although I haven't tested it.
So what's happening is basically what @vitalyd is saying about independent data paths – notice that when d11
or d10
wins, the whole code is basically parallel! The serial dependency occurs only when d01
wins (about one third of cases). So I pushed the idea further and tried to separate parallel path and serial path (full code):
let d = {
let d11 = prev[idx_col - 1];
let d01 = curr[idx_col - 1];
let d10 = prev[idx_col];
let mmm = min2(d11, d10);
if d01 < mmm {
d01 + self.square_dist(idx_line, idx_col, other)
} else {
mmm + self.square_dist(idx_line, idx_col, other)
}
};
So this is the basically the fastest version on my CPU! (the min2
is implemented as if x < y { x } else { y }
and compiles to single minsd
instruction)
Also, quite interesting observation, when instead of min3(d11, d01, d10)
you do min2(min2(d11, d10), d01)
, you also get some nice speedup (note the slight change in argument order). I presume that's because you don't have to wait on d01
when computing first min2
.
The times I get on my machine – with native
(and without):
-
min3(d11, d01, d10)
(branchless) – 21s (35s), -
min3(d11, d10, d01)
(branchless) – 18s (27s) -
min2(min2(d11, d10), d01)
(also branchless) – 15s (25s), -
min3(d11, d01, d10)
(match
version) – 14s (24s), - this posts's version – 9s (30s‽)
Anyway, bound checks in this and @matklad's version are not elided, but the CPU doesn't seem to care. I've tried to get rid of the bound checks by rewriting the loop to use iterators, the bound-checks were gone, but with no speedup (or with speedup < 10%). (That doesn't mean it's not worth to eliminate bound-checks in general – it's really important if you expect the compiler to vectorize the code)