Version: v1.2.3 | Status: Active | Last Updated: March 2026
Provides a tiled (cache-efficient) matrix multiplication kernel in pure Python/NumPy. Implements BLAS-style blocked matmul for improved cache locality, plus batched multiplication and FLOP counting.
- Tiled matrix multiplication with configurable block size for cache efficiency
- Batched matrix multiplication support for 3D tensor inputs
- FLOP counting via matmul_flops(M, K, N) = 2MK*N
from codomyrmex.matmul_kernel import tiled_matmul, batched_matmul, matmul_flops
C = tiled_matmul(A, B, tile_size=32)
flops = matmul_flops(M=128, K=64, N=256)tiled_matmul, batched_matmul, matmul_flops