I wanted have a BLAS library that is easy and fast to compile, also learn how to implement math functions on a computer, so I started this library.
This library is very new, I just finished implementing sgemm, It has only single precision level1 routines, sgemv, and sgemm. Right now sgemm is slightly faster than numpy(OpenBLAS) on my cpu (Ryzen 2200g). All functions use avx and fma so it only supports thoose cpus and I would like to support more cpus, also have implementations of all BLAS functions.
I am 3rd year bachelors degree student and new to BLAS, so I would like knowlegable people to criticize and help me build this library. Also adding more tests and benchmarks or only running benchmarks on your cpu would be good.
sgemm implementation is heavily inspired by: https://github.com/flame/how-to-optimize-gemm/wiki