Auto-vectorization in Rust

chriselrod · January 20, 2019, 1:50am

My .cargo/config did have target-cpu=native:

$ cat .cargo/config 
[target.'cfg(any(windows, unix))']
rustflags = ["-C", "target-cpu=native", "-C", "llvm-args=-ffast-math", "-C", "opt-level=3", "-C", "llvm-args=-force-vector-width=16"]

I've now taken out "-C", "llvm-args=-force-vector-width=16" (forcing vector width mostly just seems necessary with gcc-compilers).

Anyway, very nice! That was a simple fix.

However, even with -C llvm-args=-ffast-math it wont emit f[n]m[add/sub] instructions.
Additionally, it does domain checks on the square roots:

  vcmpltps %zmm0, %zmm12, %k1
  vsqrtps %zmm12, %zmm12
  vdivps %zmm12, %zmm1, %zmm12
  vbroadcastss %xmm2, %zmm12 {%k1}

or if you prefer Intel syntax:

  vcmpltps k1, zmm12, zmm0
  vsqrtps zmm12, zmm12
  vdivps zmm12, zmm1, zmm12
  vbroadcastss zmm12 {k1}, xmm2

gcc also does checks, but it did masked square root operations. I'm not sure what xmm2 contains.
I'd have to try the code with non-positive definite matrices to find out.
But in practice, the inputs should be guaranteed to contain valid data (so that the square roots are positive).

EDIT:
However, something is incorrect (results were wrong; mostly NaN).

fn processbpp_safe(x: &mut [f32], bpp: &[f32], n: usize){
    let x = &mut x[0..3*n];
    let bpp = &bpp[0..10*n];
    for (x, bpp) in x.chunks_exact_mut(3).zip(bpp.windows(10)) {
        let i = 0;
        let n = 1;
        let (x1,x2,x3) = pdbacksolve(
            (bpp[i], bpp[i + n], bpp[i + 2*n]),
            (bpp[i + 4*n], bpp[i + 5*n], bpp[i + 6*n], bpp[i + 7*n], bpp[i + 8*n], bpp[i + 9*n])
        );

        x[i      ] = x1;
        x[i +   n] = x2;
        x[i + 2*n] = x3;
    }
}

I suspect the issue is with the stride. The data layout is "struct of arrays" style, so that x is organized as n instances of x1, followed by n instances of x2, etc...

For what it's worth, this awkward version is also vectorized, and makes the underlying data layout more clear:

Ideally, there would be an actual struct wrapping the arrays to make the API (and working with elements) more clear.

EDIT:
For what it's worth, here is a series of benchmarks run from Julia, with N = 2827, with results sorted from fastest to slowest:

julia> @benchmark clangtest($X32,  $BPP32,  $N) # LLVM, C++
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.032 μs (0.00% GC)
  median time:      2.042 μs (0.00% GC)
  mean time:        2.132 μs (0.00% GC)
  maximum time:     5.793 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark icpctest($X32,  $BPP32,  $N) # Intel, C++
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.284 μs (0.00% GC)
  median time:      2.364 μs (0.00% GC)
  mean time:        2.429 μs (0.00% GC)
  maximum time:     6.087 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark flangtest2($X32,  $BPP32,  $N)  # LLVM Fortran, with workaround for GNU performance bug
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.330 μs (0.00% GC)
  median time:      2.346 μs (0.00% GC)
  mean time:        2.454 μs (0.00% GC)
  maximum time:     6.004 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark flangtest($X32,  $BPP32,  $N) # llvm Fortran
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.330 μs (0.00% GC)
  median time:      2.404 μs (0.00% GC)
  mean time:        2.477 μs (0.00% GC)
  maximum time:     6.159 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark ispctest($X32,  $BPP32,  $N) #LLVM single program multiple data compiler
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.348 μs (0.00% GC)
  median time:      2.429 μs (0.00% GC)
  mean time:        2.477 μs (0.00% GC)
  maximum time:     6.165 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark juliatest_fast!($X32,  $BPP32) # Julia, but I cheated with a loop vectorization macro and rsqrt intrinsic
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.528 μs (0.00% GC)
  median time:      2.605 μs (0.00% GC)
  mean time:        2.665 μs (0.00% GC)
  maximum time:     6.547 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark juliatest!($X32,  $BPP32) # like the above, but I added a Newton step after the rsqrt for increased accuracy
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.554 μs (0.00% GC)
  median time:      3.760 μs (0.00% GC)
  mean time:        3.885 μs (0.00% GC)
  maximum time:     7.952 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark gforttest2($X32,  $BPP32,  $N) # GNU Fortran, with workaround
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.244 μs (0.00% GC)
  median time:      4.259 μs (0.00% GC)
  mean time:        4.428 μs (0.00% GC)
  maximum time:     9.575 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark gpptest($X32,  $BPP32,  $N) # GNU C++
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.255 μs (0.00% GC)
  median time:      4.375 μs (0.00% GC)
  mean time:        4.449 μs (0.00% GC)
  maximum time:     9.463 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark rusttest($X32,  $BPP32,  $NU) # Rust
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.547 μs (0.00% GC)
  median time:      4.600 μs (0.00% GC)
  mean time:        4.747 μs (0.00% GC)
  maximum time:     10.810 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark ifort2($X32,  $BPP32,  $N) # Intel Fortran; with workaround for GNU Fortran performance bug
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.798 μs (0.00% GC)
  median time:      17.011 μs (0.00% GC)
  mean time:        17.473 μs (0.00% GC)
  maximum time:     50.643 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark ifort($X32,  $BPP32,  $N) # Intel Fortran; no workaround for GNU Fortran performance bug
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.796 μs (0.00% GC)
  median time:      17.049 μs (0.00% GC)
  mean time:        17.683 μs (0.00% GC)
  maximum time:     50.748 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gforttest($X32,  $BPP32,  $N) # GNU Fortran; no workaround for GNU Fortran performance bug
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     150.899 μs (0.00% GC)
  median time:      151.311 μs (0.00% GC)
  mean time:        156.347 μs (0.00% GC)
  maximum time:     226.430 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Basic summary: clang++ produced the fastest code, at about 2 microseconds.
Rust clocked in at around 4.5 microseconds. This is largely because of calculating the reciprocal square roots via dividing 1 by the square root. Dividing 1 by the square root is more accurate, so that is reasonable.
The lack of fused arithmetic instructions play a roll.

g++, and gfortran ran on a code to avoid a performance regression were a little faster than Rust. They use the reciprocal square root intrinsic, but add a Newton step for better accuracy. They also fuse multiplications and additions.
The Julia version using a Newton iteration after the reciprocal square root took about 3.6 microseconds. Without the step, it took 2.6. (For the record, all the compilers using reciprocal square roots on their own did add the step).

ifort fell far behind at around 16 microseconds, while gfortran completely failed to vectorize the first version of the Fortran code, taking over 150 microseconds (there are open bug reports on bugzilla).

C++ is the C++ code, assembly, and compiler options.
Rust is the Rust code.
Fortran code. GNUprocessBPP has the workaround for GNU Fortran; processBPP does not. Unfortunately, Flang is not included among the optional compilers.
ispc code.

Topic		Replies	Views
Vectorize This Loop help	6	1298	April 8, 2020
Experience porting a simple MC simulation to Rust	30	6846	January 12, 2023
Performance issue with C-array like computation (2 times worst than naive java) help	48	6547	January 12, 2023
Vec's overhead compared to C++ for DSP application help	18	4544	January 12, 2023
Optimizing linear algebra code help	48	2829	June 18, 2020

Auto-vectorization in Rust

Related topics