
I know the correct answer is to use Cuda / Intel MKL.

I am curious if there is any good tutorials on writing an optimized sgemm. (Preferably one that walks through a number of optimizations.)

It is fine is if the tutorial is not written in Rust, as long as it can convert to pure Rust (if it's C/C++, it's probably fine. If it involves sse4 assembly instructions by hand, it's probably too far.)

Single threaded only performance is fine.
Edit: above said "stemms" when I meant "SGEMM", the f32 matrix * matrix function.
There is @bluss's matrixmultiply:
I also tried myself, but on a very specialized case.
simd_linalg/block.rs at master · s3bk/simd_linalg · GitHub
C = A * B
Are the key ideas:

Find the largest k such that 3 * k * k <= L1 cache and k is a power of 2.

Divide C, A, B into squares of size k * k

Do sgemm on k*k sized squares ?
Thank you!
Fwiw, I copied all ideas from BLIS, papers linked in this old blog post gemm: a rabbit hole
Maybe you can find newer papers or other approaches that you want to try if you want.
@bluss: Looking at your detailed blog posts, is the following correct:
 decompose A * B into
 (4xk) * (k*4), and decompose that into
 4x4 and
 use either auto vectorization (according to blog) or packed_simd (according to source code) ?
Is the last trick to copy both 4x4 matrices to the stack before multiply ?
Is BLAS GEMM Benchmarks accurate ?
Docs claims 62 gflops. My current code, using blocked matrices, is only hitting 1.25 gflop on similar hardware.