Performance issue with C-array like computation (2 times worst than naive java)

To make JVM/Rust comparison easier, I've written the same code in Rust and Kotlin: https://github.com/matklad/kotlin-rust-dp/tree/master/src

On my machine Kotlin is indeed faster than Rust (with --target-cpu=native). Rust takes 32200 ms, and Kotlin 28000 - 29000 ms.

2 Likes

Hm, I think I've found out what is different between Rust and Kotlin...

Turns out Rust is slower because rustc generates better code (sic!).

The actual bottleneck is the calculation of the minimum of three f64 numbers.

In Rust, I've implemented two variants of the function:

pub fn min3_jmp(x: f64, y: f64, z: f64) -> f64 {
    match (x < y, x < z, y < z) {
        (true, true, _) => x,
        (true, false, _) => z,
        (false, _, true) => y,
        (false, _, false) => z,
    }
}

pub fn min3_vec(x: f64, y: f64, z: f64) -> f64 {
    if x < y {
        if x < z { x } else { z }
    } else {
        if y < z { y } else { z }
    }
}

It is instrumental to see the assembly for both variants: Compiler Explorer

The version with nested ifs generates a vectorized code. The version with match generates conditions and jumps (that is, match is actually compiled to something resembling nested if).

And it turns out that non-vectorized version is faster on my CPU! On a smaller input file, it takes 5400 ms, while vectorized version needs 6200 ms and Kotlin finishes in 5800 ms.

It's useful to take a look at the call site:

let d11 = prev[idx_col - 1];
let d01 = curr[idx_col - 1];
let d10 = prev[idx_col];
curr[idx_col] = min3_jmp(d11, d01, d10) + square_dist(xs, idx_line, ys, idx_col);

d11 is first and almost always it will be the smallest number (looks like the original series align pretty well). That means that non-vectorized version always takes the main branch of both ifs. If we change the order to min3_jmp(d01, d10, d11), then non-vectorized version slows down to 6200 ms and vectorized version still needs 6200 ms (can someone with the knowledge of CPU microarchitecture explain why this numbers align perfectly? My CPU is Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz).

Perf Stat

Non-vectorized, d11 first.

5442 ms
Total error: 1208231.6119755413

 Performance counter stats for 'target/release/ts small.csv':

       5455.437177      task-clock (msec)         #    1.000 CPUs utilized          
                 3      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               298      page-faults               #    0.055 K/sec                  
    11,429,874,088      cycles                    #    2.095 GHz                    
     1,628,326,391      stalled-cycles-frontend   #   14.25% frontend cycles idle   
   <not supported>      stalled-cycles-backend   
    26,743,341,581      instructions              #    2.34  insns per cycle        
                                                  #    0.06  stalled cycles per insn
     7,776,354,815      branches                  # 1425.432 M/sec                  
       111,696,902      branch-misses             #    1.44% of all branches        

       5.455566139 seconds time elapsed

Non-vectorized, d11 last

6194 ms
Total error: 1208231.6119755413

 Performance counter stats for 'target/release/ts small.csv':

       6208.593222      task-clock (msec)         #    1.000 CPUs utilized          
                 3      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               298      page-faults               #    0.048 K/sec                  
    13,007,845,095      cycles                    #    2.095 GHz                    
     2,557,504,407      stalled-cycles-frontend   #   19.66% frontend cycles idle   
   <not supported>      stalled-cycles-backend   
    26,506,017,774      instructions              #    2.04  insns per cycle        
                                                  #    0.10  stalled cycles per insn
     7,706,159,595      branches                  # 1241.209 M/sec                  
       130,527,368      branch-misses             #    1.69% of all branches        

       6.208684533 seconds time elapsed

Vectorized

6210 ms
Total error: 1208231.6119755413

 Performance counter stats for 'target/release/ts small.csv':

       6223.569723      task-clock (msec)         #    1.000 CPUs utilized          
                 8      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               299      page-faults               #    0.048 K/sec                  
    13,039,201,138      cycles                    #    2.095 GHz                    
     5,709,706,972      stalled-cycles-frontend   #   43.79% frontend cycles idle   
   <not supported>      stalled-cycles-backend   
    24,788,937,762      instructions              #    1.90  insns per cycle        
                                                  #    0.23  stalled cycles per insn
     4,367,009,413      branches                  #  701.689 M/sec                  
         5,479,578      branch-misses             #    0.13% of all branches        

       6.223708120 seconds time elapsed

So, by changing naive nested ifs which generate the best code with somewhat awkward match which generates worse code we are able to match HotSpot's performance, yay!

But what code does HotSpot JIT generate? I don't know a handy way of inspecting generated assembly, but if I change the order of d11 in Kotlin the same way as in Rust, then performance drops from 5600 ms to 6800, so it is most likely that HotSpot does not vectorize min3.

4 Likes

I don't know if this is still usable:

2 Likes

I mentioned upthread that Hotspot does not vectorize (apart from memcpy's) :slight_smile:. Inspecting Hotspot assembly is a pain - you need to put a disassembler SO/dylib on the ld path, and then add some cmdline flags when running the JVM. PrintAssembly - PrintAssembly - OpenJDK Wiki has more info.

2 Likes

@matklad That's amazing! Thank you so much!
On my computer (Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz)

Java code: 21s (to answer your question, it pre-allocates a double)

Rust vectorized (with the -C target-cpu=native) : 32s

Rust jump (d01, d10, d11) : 16s
Rust jump (d01, d11, d10) : 14s
Rust jump ((d10, d01, d11) : 17s
Rust jump (d10, d11, d01) : 14s
Rust jump (d11, d01, d10) : 15s
Rust jump (d11, d10, d01) : 14s

Of course none of the Rust jump is better than an other as this is data dependent.
But overall, that win over java by around 20%, so pretty good!

It's a bit weird to see that "better" code run slower... this is actually scary! I mean, the jump version only take 50% of the time of the vectorized one! With perf stat we can clearly see that the number of instructions per cycle is dropping even though their is basically no branch miss-prediction.

Is that an issue that should be reported to the Rust project ?

I wouldn't call it vectorized, it just happens to use some vector instructions. I think one should call it just branchless. Anyway, there's another way you can rewrite the min3 – basically by using two min2 (godbolt):

pub fn min3_two_steps(x: f64, y: f64, z: f64) -> f64 {
    let xy = if x < y { x } else { y };
    if z < xy { z } else { xy }
}

This version generates the cleanest assembly, it's basically just two minsd, but I think I've tested it and the program slowed down. It's quite crazy! (actually, maybe not so crazy, because this version has two min instructions to be executed serially and your versions can compare in parallel, and then just use some simple instructions to get the final result).


I can definitely recommend rayon. Basically, the first step you should do is to change the loop to use iterators – something.iter().map(...).sum(). (Maybe even parallelize the nested loop too by adding a flat_map inside the chain.) After you've rewritten the loop in iterator style, plugging rayon should be just a matter of changing iter to par_iter. It's also worth setting appropriate weight, because rayon by default thinks that the single operation is cheap.

3 Likes

@krdln Thank you!
I'll try rayon, being able to parallelize easily is crucial.
Trying with min3_two_step: going back to 30s...

I'm happy I gave rust a shot, and I'll probably keep using it, at least for my own testing!

That's a much better term indeed, thanks!

I think we can say it's branchless but also carries lots more data dependencies - the vector instructions all seem to have inputs that are computed right before, ie quite a bit of serial execution (and IPC suffers). The branchy code has more total instructions and branches but it ends up being well predicted and has some independent data paths.

So one can say the vector version will have better worst case and avg case when fed unpredictable inputs, but obviously loses the best case when inputs are predictable.

This is maybe where PGO could help, although I don't know if LLVM will generate anything different for the highly predicted case. And of course PGO will require feeding representative input, which isn't always feasible.

I've looked a little bit more into this curious min3 behaviour and I think I've realized what's going on!

The first hypothesis was that one of the branches was really being taken significantly more often than the others, but it seems that d11, d10 and d01 β€œwin” about the same number of times. The pattern might be predictable though, although I haven't tested it.

So what's happening is basically what @vitalyd is saying about independent data paths – notice that when d11 or d10 wins, the whole code is basically parallel! The serial dependency occurs only when d01 wins (about one third of cases). So I pushed the idea further and tried to separate parallel path and serial path (full code):

                let d = {
                    let d11 = prev[idx_col - 1];
                    let d01 = curr[idx_col - 1];
                    let d10 = prev[idx_col];
                    let mmm = min2(d11, d10);
                    if d01 < mmm {
                        d01 + self.square_dist(idx_line, idx_col, other)
                    } else {
                        mmm + self.square_dist(idx_line, idx_col, other)
                    }
                };

So this is the basically the fastest version on my CPU! (the min2 is implemented as if x < y { x } else { y } and compiles to single minsd instruction)

Also, quite interesting observation, when instead of min3(d11, d01, d10) you do min2(min2(d11, d10), d01), you also get some nice speedup (note the slight change in argument order). I presume that's because you don't have to wait on d01 when computing first min2.

The times I get on my machine – with native (and without):

  • min3(d11, d01, d10) (branchless) – 21s (35s),
  • min3(d11, d10, d01) (branchless) – 18s (27s)
  • min2(min2(d11, d10), d01) (also branchless) – 15s (25s),
  • min3(d11, d01, d10) (match version) – 14s (24s),
  • this posts's version – 9s (30sβ€½)

Anyway, bound checks in this and @matklad's version are not elided, but the CPU doesn't seem to care. I've tried to get rid of the bound checks by rewriting the loop to use iterators, the bound-checks were gone, but with no speedup (or with speedup < 10%). (That doesn't mean it's not worth to eliminate bound-checks in general – it's really important if you expect the compiler to vectorize the code)

1 Like

I my current code, using the 'jmp' version min3_jmp(d01, d11, d10), I achieve 12.5s (note: removing bound checking by using get_unchecked actually allowed me to come down from 16s). When testing your last version, it goes up to 17s seconds. So I guess we are reaching CPU variability...

How do you see that, in the branch version, the code is parallel if d11 or d10 wins ?

1 Like

Yeah, or probably llvm doing weird things. I've updated my times to include the non-native compilation and my last version went from 9s to 30s! (what's more fun, the generated assembly is almost identical!)

I'm also curious what time do you get on min2(min2(d11, d10), d01) (note the order)?

If β€œthe d01 wins” has it's own branch, and the CPU is predicting the β€œd11 or d10” branch to be taken, the code CPU sees is basically:

next[i] = min(prev[i-1], prev[i]) + self.square_dist(…);
next[i+1] = min(prev[i], prev[i+1]) + self.square_dist(…);
...

These lines can be computed in parallel, if the CPU has enough lookahead and spare FPU units.

If you're talking about the branch vs. branchless performance, I guess not. Maybe to LLVM, but it's a really specific case. But the nightly regression is definitely worth reporting! You should minimize the testcase though, I guess.

1 Like

I don't know, if anyone mentioned it, but the password for the dataset zip is "attempttoclassify".

Yes, you are absolutely correct! In my analysis I've assumed that d11 wins significantly more often, but I have not actually measured it (because global variables are hard in Rust), and of course it is wrong! d11 wins only about 15% more often than d01 or d10.

And indeed, taking min2 of two prev numbers and then looking at curr (min2(min2(d11, d10), d01)) brings the time for the benchmark from my previous post from 5400 ms to 4100 ms! The bad order of min2(min2(d11, d01), d10) takes 6100 ms.

I think the new (>0.4.3) versions of rayon do better without the call to weight.

The trick with min impressed me greatly so I've even blogged about it: Min of Three :slight_smile:

10 Likes

It was a good read!

Imagine for a second that instead of vminsd %xmm0,%xmm1,%xmm0 instruction in the preceding assembly there is just vmovsd %xmm1,%xmm0. That is, we don’t use xmm0 from the previous iteration at all!

I'm really curious if the CPU really performs such desugaring and speculative execution or... is it just that the CPU can run the vminsd (%rax,%rsi,8),%xmm1,%xmm1 from the next iteration simultaneously with vminsd %xmm0,%xmm1,%xmm0 from the current iteration.

A few posts above I tried to force the compiler to compile the second min2 to branchy code, thus forcing the CPU to do speculation, and the code got even faster (unfortunately only on machine).

It’s interesting that we can make the computation truly parallel if we update the cells diagonally:

Whoa! It's so simple and elegant that I feel stupid for not thinking about it before! It should also vectorize very well. I won't be surprised if it could provide 4Γ— or even 8Γ— perf improvement.

I've implemented diagonal update (Code on Rust playground) and the resulting code is not faster, even with explicit rayon parallelism.

Maybe I miss something obvious and it could be much faster, but my mind is completely exhausted by a billion of by one errors in this diagonal scheme :smile:

Though there is a possible explanation why diagonal is slower:

If you fill the matrix row by row, the inner loop processes only three arrays: current and previous rows of the matrix and one sequences.

With diagonals, you need three diagonals and both sequences.

What's the matrix size here? I think that if it's about 270 as in original example, then it's a little bit too little for rayon. Maybe not though?

I've tried to diagonalize too, here's a gist. It's quite fast for me (9s vs 15 not-diagonalize), but I don't why it doesn't vectorize.