Performance issue with C-array like computation (2 times worst than naive java)

@matklad That’s amazing! Thank you so much!
On my computer (Intel® Core™ i7-6820HQ CPU @ 2.70GHz)

Java code: 21s (to answer your question, it pre-allocates a double[][])

Rust vectorized (with the -C target-cpu=native) : 32s

Rust jump (d01, d10, d11) : 16s
Rust jump (d01, d11, d10) : 14s
Rust jump ((d10, d01, d11) : 17s
Rust jump (d10, d11, d01) : 14s
Rust jump (d11, d01, d10) : 15s
Rust jump (d11, d10, d01) : 14s

Of course none of the Rust jump is better than an other as this is data dependent.
But overall, that win over java by around 20%, so pretty good!

It’s a bit weird to see that “better” code run slower… this is actually scary! I mean, the jump version only take 50% of the time of the vectorized one! With perf stat we can clearly see that the number of instructions per cycle is dropping even though their is basically no branch miss-prediction.

Is that an issue that should be reported to the Rust project ?

I wouldn’t call it vectorized, it just happens to use some vector instructions. I think one should call it just branchless. Anyway, there’s another way you can rewrite the min3 – basically by using two min2 (godbolt):

pub fn min3_two_steps(x: f64, y: f64, z: f64) -> f64 {
    let xy = if x < y { x } else { y };
    if z < xy { z } else { xy }
}

This version generates the cleanest assembly, it’s basically just two minsd, but I think I’ve tested it and the program slowed down. It’s quite crazy! (actually, maybe not so crazy, because this version has two min instructions to be executed serially and your versions can compare in parallel, and then just use some simple instructions to get the final result).


I can definitely recommend rayon. Basically, the first step you should do is to change the loop to use iterators – something.iter().map(...).sum(). (Maybe even parallelize the nested loop too by adding a flat_map inside the chain.) After you’ve rewritten the loop in iterator style, plugging rayon should be just a matter of changing iter to par_iter. It’s also worth setting appropriate weight, because rayon by default thinks that the single operation is cheap.

3 Likes

@krdln Thank you!
I’ll try rayon, being able to parallelize easily is crucial.
Trying with min3_two_step: going back to 30s…

I’m happy I gave rust a shot, and I’ll probably keep using it, at least for my own testing!

That’s a much better term indeed, thanks!

I think we can say it’s branchless but also carries lots more data dependencies - the vector instructions all seem to have inputs that are computed right before, ie quite a bit of serial execution (and IPC suffers). The branchy code has more total instructions and branches but it ends up being well predicted and has some independent data paths.

So one can say the vector version will have better worst case and avg case when fed unpredictable inputs, but obviously loses the best case when inputs are predictable.

This is maybe where PGO could help, although I don’t know if LLVM will generate anything different for the highly predicted case. And of course PGO will require feeding representative input, which isn’t always feasible.

I’ve looked a little bit more into this curious min3 behaviour and I think I’ve realized what’s going on!

The first hypothesis was that one of the branches was really being taken significantly more often than the others, but it seems that d11, d10 and d01 “win” about the same number of times. The pattern might be predictable though, although I haven’t tested it.

So what’s happening is basically what @vitalyd is saying about independent data paths – notice that when d11 or d10 wins, the whole code is basically parallel! The serial dependency occurs only when d01 wins (about one third of cases). So I pushed the idea further and tried to separate parallel path and serial path (full code):

                let d = {
                    let d11 = prev[idx_col - 1];
                    let d01 = curr[idx_col - 1];
                    let d10 = prev[idx_col];
                    let mmm = min2(d11, d10);
                    if d01 < mmm {
                        d01 + self.square_dist(idx_line, idx_col, other)
                    } else {
                        mmm + self.square_dist(idx_line, idx_col, other)
                    }
                };

So this is the basically the fastest version on my CPU! (the min2 is implemented as if x < y { x } else { y } and compiles to single minsd instruction)

Also, quite interesting observation, when instead of min3(d11, d01, d10) you do min2(min2(d11, d10), d01), you also get some nice speedup (note the slight change in argument order). I presume that’s because you don’t have to wait on d01 when computing first min2.

The times I get on my machine – with native (and without):

  • min3(d11, d01, d10) (branchless) – 21s (35s),
  • min3(d11, d10, d01) (branchless) – 18s (27s)
  • min2(min2(d11, d10), d01) (also branchless) – 15s (25s),
  • min3(d11, d01, d10) (match version) – 14s (24s),
  • this posts’s version – 9s (30s‽)

Anyway, bound checks in this and @matklad’s version are not elided, but the CPU doesn’t seem to care. I’ve tried to get rid of the bound checks by rewriting the loop to use iterators, the bound-checks were gone, but with no speedup (or with speedup < 10%). (That doesn’t mean it’s not worth to eliminate bound-checks in general – it’s really important if you expect the compiler to vectorize the code)

1 Like

I my current code, using the ‘jmp’ version min3_jmp(d01, d11, d10), I achieve 12.5s (note: removing bound checking by using get_unchecked actually allowed me to come down from 16s). When testing your last version, it goes up to 17s seconds. So I guess we are reaching CPU variability…

How do you see that, in the branch version, the code is parallel if d11 or d10 wins ?

1 Like

Yeah, or probably llvm doing weird things. I’ve updated my times to include the non-native compilation and my last version went from 9s to 30s! (what’s more fun, the generated assembly is almost identical!)

I’m also curious what time do you get on min2(min2(d11, d10), d01) (note the order)?

If “the d01 wins” has it’s own branch, and the CPU is predicting the “d11 or d10” branch to be taken, the code CPU sees is basically:

next[i] = min(prev[i-1], prev[i]) + self.square_dist(…);
next[i+1] = min(prev[i], prev[i+1]) + self.square_dist(…);
...

These lines can be computed in parallel, if the CPU has enough lookahead and spare FPU units.

If you’re talking about the branch vs. branchless performance, I guess not. Maybe to LLVM, but it’s a really specific case. But the nightly regression is definitely worth reporting! You should minimize the testcase though, I guess.

1 Like

I don’t know, if anyone mentioned it, but the password for the dataset zip is “attempttoclassify”.

Yes, you are absolutely correct! In my analysis I’ve assumed that d11 wins significantly more often, but I have not actually measured it (because global variables are hard in Rust), and of course it is wrong! d11 wins only about 15% more often than d01 or d10.

And indeed, taking min2 of two prev numbers and then looking at curr (min2(min2(d11, d10), d01)) brings the time for the benchmark from my previous post from 5400 ms to 4100 ms! The bad order of min2(min2(d11, d01), d10) takes 6100 ms.

I think the new (>0.4.3) versions of rayon do better without the call to weight.

The trick with min impressed me greatly so I’ve even blogged about it: https://matklad.github.io/2017/03/12/min-of-three.html :slight_smile:

10 Likes

It was a good read!

Imagine for a second that instead of vminsd %xmm0,%xmm1,%xmm0 instruction in the preceding assembly there is just vmovsd %xmm1,%xmm0. That is, we don’t use xmm0 from the previous iteration at all!

I’m really curious if the CPU really performs such desugaring and speculative execution or… is it just that the CPU can run the vminsd (%rax,%rsi,8),%xmm1,%xmm1 from the next iteration simultaneously with vminsd %xmm0,%xmm1,%xmm0 from the current iteration.

A few posts above I tried to force the compiler to compile the second min2 to branchy code, thus forcing the CPU to do speculation, and the code got even faster (unfortunately only on machine).

It’s interesting that we can make the computation truly parallel if we update the cells diagonally:

Whoa! It’s so simple and elegant that I feel stupid for not thinking about it before! It should also vectorize very well. I won’t be surprised if it could provide 4× or even 8× perf improvement.

I’ve implemented diagonal update (Code on Rust playground) and the resulting code is not faster, even with explicit rayon parallelism.

Maybe I miss something obvious and it could be much faster, but my mind is completely exhausted by a billion of by one errors in this diagonal scheme :smile:

Though there is a possible explanation why diagonal is slower:

If you fill the matrix row by row, the inner loop processes only three arrays: current and previous rows of the matrix and one sequences.

With diagonals, you need three diagonals and both sequences.

What’s the matrix size here? I think that if it’s about 270 as in original example, then it’s a little bit too little for rayon. Maybe not though?

I’ve tried to diagonalize too, here’s a gist. It’s quite fast for me (9s vs 15 not-diagonalize), but I don’t why it doesn’t vectorize.

@matklad I’ve managed to vectorize it! Check out the gist, it should be updated now. It seems that llvm gets confused by the rotation of diagonal buffers – yet another bug to file. Adding another buffer seemed to help it. Now I got it down to 3 seconds, so more than 4× improvement over non-vectorized one!

Edit: Moving iteration to separate function instead of creating another buffer also helps and improves from 3.6s to 2.5s.

1 Like

@matklad, @krdln Hi guys, I’m glad that you are really interested in this! The blog is a good read, and if you want to go further in the parallel part, you can take a look at the article “Parallelizing Dynamic Time Warping Algorithm Using Prefix Computations on GPU” (don’t want to do article advertisement again…). The diagonal thing is kinda know in the dynamic programming field, but it does not allow to occupy the GPU correctly. As the article is behind a paywall, here is the idea: you can actually do most of the work in parallel while using only the previous row. Then, to take care of the “line dependency”, you can use the prefixes, also on GPU : http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
Of course this is only worth it on big time series. Even a length of 270 may be too small to show a real interest… (not tested yet).

It’s impressive to see how you guys went down from around 50s to around 3s !

Thanks for linking the article!

Hah! I had a feeling that you could use prefix sum for that although for some reason I wasn’t 100% convinced. Now I know :slight_smile:

I’m not sure I understand this. I assume the algorithm described in the article has to do a kernel call per each row. So basically the main problem with diagonals is that you’re running the kernel two times more compared to once per row (for short diagonals the most overhead comes from running the kernel, the overall amount of computation is the same). But then they try to overcome that issue with doing prefix sum for each row, which costs, let’s assume, 2-5 additional kernel calls? I’d like to see a comparison between the paper’s approach to using diagonals on GPU.

Overall, this problem seems to be memory bound – the GPUs are doing just a few instructions per each memory read / write (in the computation itself; prefix sum is slightly more compute-consuming). I feel that to really squeeze the GPU performance one should perform the algorithm in a diagonal style, but trying to do more rows at a time – that would mean one GPU block would have to perform additional calculations to avoid synchronization, but would greatly reduce memory throughput.

I’m not a GPU expert, but if I understand well, the problem with the diagonal is that at the first step you only have one computation to do, then two, etc… So you spend lots of time before being able to “saturate” the GPU. The paper does not compare with a diagonal calculation though… I’ll have to check that but unfortunately I must focus on other matter right now…

I still have 2 quick Rust things: I went to an implementation with only two row instead of the whole matrix. I’m using a toolbox that provide the buffer :

struct TimeSerieWS {
    /// DTW working space: we need two slices
    dtw_slice1: Vec<f64>,
    dtw_slice2: Vec<f64>
}

impl TimeSerieWS {
    fn new(ts:&TimeSerie) -> TimeSerieWS {
        let dtw1 = vec![0f64; ts.len()];
        let dtw2 = vec![0f64; ts.len()];
        TimeSerieWS{dtw_slice1: dtw1, dtw_slice2: dtw2}
    }
}

And in my DTW code I have :

fn compute_dtw<'a>(&self, other:&Self, ws:&'a mut TimeSerieWS) -> f64 {
        let mut curr:&mut[f64] =  &mut ws.dtw_slice1;
        let mut prev:&mut[f64] =  &mut ws.dtw_slice2;
  • If I don’t put &mut[f64], the code takes 17s instead of 12s !! The code uses unsafe and get_unchecked methods. Is that normal ?
  • I wanted to put directly slices into my working space area, I can’t achieve it
struct TimeSerieWS<'a> {
    dtw_slice1: &'a mut[f64],
    dtw_slice2: &'a mut[f64]
}

fn new(ts:&TimeSerie) -> TimeSerieWS<'a> {
        let dtw1 = vec![0f64; ts.len()];
        let dtw2 = vec![0f64; ts.len()];
        TimeSerieWS{dtw_slice1: &mut dtw1, dtw_slice2: &mut dtw2}
}

That surely won’t work, I understand that I should rather ‘move’ things in the structure rather than passing a borrow, but I struggle finding how to do that…