A row-major implementation of sgemm.c, utilizing compile time metaprogramming for a 60% reduction in lines of code!
The kernel is templated, and >=g++-14 does not spill registers when using C-style arrays.
Gets ~97% of the speed of Intel MKL on certain shapes (single-threaded); see TODO.
- Tune block sizes for Alder Lake
- Parallelize with OpenMP