
I know the correct answer is to use Cuda / Intel MKL.

I am curious if there is any good tutorials on writing an optimized sgemm. (Preferably one that walks through a number of optimizations.)

It is fine is if the tutorial is not written in Rust, as long as it can convert to pure Rust (if it's C/C++, it's probably fine. If it involves sse4 assembly instructions by hand, it's probably too far.)

Single threaded only performance is fine.
Edit: above said "stemms" when I meant "SGEMM", the f32 matrix * matrix function.
There is @bluss's matrixmultiply:
I also tried myself, but on a very specialized case.
simd_linalg/block.rs at master · s3bk/simd_linalg · GitHub
C = A * B
Are the key ideas:

Find the largest k such that 3 * k * k <= L1 cache and k is a power of 2.

Divide C, A, B into squares of size k * k

Do sgemm on k*k sized squares ?
Thank you!
Fwiw, I copied all ideas from BLIS, papers linked in this old blog post gemm: a rabbit hole
Maybe you can find newer papers or other approaches that you want to try if you want.
@bluss: Looking at your detailed blog posts, is the following correct:
 decompose A * B into
 (4xk) * (k*4), and decompose that into
 4x4 and
 use either auto vectorization (according to blog) or packed_simd (according to source code) ?
Is the last trick to copy both 4x4 matrices to the stack before multiply ?
Is BLAS GEMM Benchmarks accurate ?
Docs claims 62 gflops. My current code, using blocked matrices, is only hitting 1.25 gflop on similar hardware.
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.