Writing optimized SGEMM in pure rust?

  1. I know the correct answer is to use Cuda / Intel MKL.

  2. I am curious if there is any good tutorials on writing an optimized sgemm. (Preferably one that walks through a number of optimizations.)

  3. It is fine is if the tutorial is not written in Rust, as long as it can convert to pure Rust (if it's C/C++, it's probably fine. If it involves sse4 assembly instructions by hand, it's probably too far.)

  4. Single threaded only performance is fine.

Edit: above said "stemms" when I meant "SGEMM", the f32 matrix * matrix function.

There is @bluss's matrixmultiply:

I also tried myself, but on a very specialized case.
simd_linalg/block.rs at master · s3bk/simd_linalg · GitHub

1 Like

C = A * B

Are the key ideas:

  1. Find the largest k such that 3 * k * k <= L1 cache and k is a power of 2.

  2. Divide C, A, B into squares of size k * k

  3. Do sgemm on k*k sized squares ?

Thank you!

Fwiw, I copied all ideas from BLIS, papers linked in this old blog post gemm: a rabbit hole

Maybe you can find newer papers or other approaches that you want to try if you want.

@bluss: Looking at your detailed blog posts, is the following correct:

  1. decompose A * B into
  2. (4xk) * (k*4), and decompose that into
  3. 4x4 and
  4. use either auto vectorization (according to blog) or packed_simd (according to source code) ?

Is the last trick to copy both 4x4 matrices to the stack before multiply ?

Is BLAS GEMM Benchmarks accurate ?

Docs claims 62 gflops. My current code, using blocked matrices, is only hitting 1.25 gflop on similar hardware.