Hm, I think I've found out what is different between Rust and Kotlin...
Turns out Rust is slower because rustc generates better code (sic!).
The actual bottleneck is the calculation of the minimum of three f64
numbers.
In Rust, I've implemented two variants of the function:
pub fn min3_jmp(x: f64, y: f64, z: f64) -> f64 {
match (x < y, x < z, y < z) {
(true, true, _) => x,
(true, false, _) => z,
(false, _, true) => y,
(false, _, false) => z,
}
}
pub fn min3_vec(x: f64, y: f64, z: f64) -> f64 {
if x < y {
if x < z { x } else { z }
} else {
if y < z { y } else { z }
}
}
It is instrumental to see the assembly for both variants: Compiler Explorer
The version with nested ifs generates a vectorized code. The version with match generates conditions and jumps (that is, match
is actually compiled to something resembling nested if).
And it turns out that non-vectorized version is faster on my CPU! On a smaller input file, it takes 5400 ms, while vectorized version needs 6200 ms and Kotlin finishes in 5800 ms.
It's useful to take a look at the call site:
let d11 = prev[idx_col - 1];
let d01 = curr[idx_col - 1];
let d10 = prev[idx_col];
curr[idx_col] = min3_jmp(d11, d01, d10) + square_dist(xs, idx_line, ys, idx_col);
d11
is first and almost always it will be the smallest number (looks like the original series align pretty well). That means that non-vectorized version always takes the main branch of both ifs. If we change the order to min3_jmp(d01, d10, d11)
, then non-vectorized version slows down to 6200 ms and vectorized version still needs 6200 ms (can someone with the knowledge of CPU microarchitecture explain why this numbers align perfectly? My CPU is Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz
).
Perf Stat
Non-vectorized, d11
first.
5442 ms
Total error: 1208231.6119755413
Performance counter stats for 'target/release/ts small.csv':
5455.437177 task-clock (msec) # 1.000 CPUs utilized
3 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
298 page-faults # 0.055 K/sec
11,429,874,088 cycles # 2.095 GHz
1,628,326,391 stalled-cycles-frontend # 14.25% frontend cycles idle
<not supported> stalled-cycles-backend
26,743,341,581 instructions # 2.34 insns per cycle
# 0.06 stalled cycles per insn
7,776,354,815 branches # 1425.432 M/sec
111,696,902 branch-misses # 1.44% of all branches
5.455566139 seconds time elapsed
Non-vectorized, d11
last
6194 ms
Total error: 1208231.6119755413
Performance counter stats for 'target/release/ts small.csv':
6208.593222 task-clock (msec) # 1.000 CPUs utilized
3 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
298 page-faults # 0.048 K/sec
13,007,845,095 cycles # 2.095 GHz
2,557,504,407 stalled-cycles-frontend # 19.66% frontend cycles idle
<not supported> stalled-cycles-backend
26,506,017,774 instructions # 2.04 insns per cycle
# 0.10 stalled cycles per insn
7,706,159,595 branches # 1241.209 M/sec
130,527,368 branch-misses # 1.69% of all branches
6.208684533 seconds time elapsed
Vectorized
6210 ms
Total error: 1208231.6119755413
Performance counter stats for 'target/release/ts small.csv':
6223.569723 task-clock (msec) # 1.000 CPUs utilized
8 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
299 page-faults # 0.048 K/sec
13,039,201,138 cycles # 2.095 GHz
5,709,706,972 stalled-cycles-frontend # 43.79% frontend cycles idle
<not supported> stalled-cycles-backend
24,788,937,762 instructions # 1.90 insns per cycle
# 0.23 stalled cycles per insn
4,367,009,413 branches # 701.689 M/sec
5,479,578 branch-misses # 0.13% of all branches
6.223708120 seconds time elapsed
So, by changing naive nested if
s which generate the best code with somewhat awkward match
which generates worse code we are able to match HotSpot's performance, yay!
But what code does HotSpot JIT generate? I don't know a handy way of inspecting generated assembly, but if I change the order of d11
in Kotlin the same way as in Rust, then performance drops from 5600 ms to 6800, so it is most likely that HotSpot does not vectorize min3
.