Fast FP32 matmul in C++

A row-major implementation of sgemm.c, utilizing compile time metaprogramming for a 60% reduction in lines of code!

The kernel is templated, and >=g++-14 does not spill registers when using C-style arrays.

Gets ~97% of the speed of Intel MKL on certain shapes (single-threaded); see TODO.

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
benchmark		benchmark
data		data
figures		figures
include		include
mojo		mojo
src		src
test		test
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
plot.py		plot.py
setup		setup