This repository contains the source code for our optimized Sparse Matrix-Sparse Matrix Multiplication (SpMSpM) kernels, developed as part of a high-performance computing project. We tackled one of the most computationally challenging problems in machine learning and graph analytics: efficiently multiplying two sparse matrices where the second matrix is "thin."
Our team employed a systematic, data-driven approach to performance optimization, developing and refining a suite of GPU kernels that achieved significant performance improvements compared to baseline implementations. Through iterative development across multiple kernel versions, we identified key optimization strategies that deliver substantial speedups.
Our best-performing implementation, kernel5.cu, represents the culmination of our optimization efforts. The performance results demonstrate clear improvements across all test datasets:
| Kernel | Dataset 1 | Dataset 2 | Dataset 3 | Dataset 4 | Dataset 5 | Dataset 6 |
|---|---|---|---|---|---|---|
kernelCPU0.cu |
8.945 ms | 33.307 ms | 66.646 ms | 99.553 ms | 270.835 ms | 914.210 ms |
kernel5.cu |
0.150 ms | 0.424 ms | 0.655 ms | 0.861 ms | 2.18 ms | 4.79 ms |
Our optimized GPU kernel achieved up to 190x speedup 🔥 on the largest dataset compared to our initial CPU implementation, demonstrating the effectiveness of our GPU optimization strategies.
Our performance gains stem from implementing three key optimization techniques in kernel5.cu:
-
Shared-Memory Tiling: We implemented a per-block shared memory buffer that accumulates partial results for an entire output row before writing to global memory. This approach loads each DRAM element exactly once per block and replaces hundreds of fine-grained global writes with just two global operations per non-zero output, significantly reducing global memory traffic and pipeline stalls.
-
Privatization: Instead of having every thread contend on a global non-zero counter, we use a single shared-memory counter per block to track non-zero entries. All atomic operations occur in low-latency shared memory, with only one global atomic operation per block. This approach dramatically reduces contention and avoids costly pipeline stalls from repeated global atomics.
-
Memory Coalescing & Occupancy Tuning: Our kernels ensure consecutive threads access consecutive memory locations using strided loops, achieving fully coalesced global memory transactions. We optimized block sizes (
BLOCK_DIM = 64) through systematic profiling to maximize GPU resource utilization and overall throughput.
To compile:
make
To run:
./spmspm [flags]
Optional flags:
-d <dataset> the prefix of the files containing the two matrices to be multiplied
the first matrix will be read from <dataset>-matrix1.txt
the second matrix will be read from <dataset>-matrix2.txt
the reference output matrix will be read from <dataset>-matrix3.txt
-a run CPU version 0
-b run CPU version 1
-0 run GPU version 0
-1 run GPU version 1
-2 run GPU version 2
-3 run GPU version 3
-4 run GPU version 4
-5 run GPU version 5
-6 run GPU version 6
-7 run GPU version 7
-8 run GPU version 8
-9 run GPU version 9
NOTE: It is okay to specify multiple versions in the same run
-v perform exact verification
-w write result to a file
the CPU result will be written to <dataset>-matrix3-cpu.txt
the GPU result will be written to <dataset>-matrix3-gpu.txt
Our kernels have practical applications in pruned and quantized dense layers for machine learning models. When neural network layers are compressed using techniques that reduce density below approximately 10%, our shared-memory, row-privatized kernel delivers order-of-magnitude speedups over dense GEMM operations while using significantly less memory.
This makes our implementation particularly valuable for deploying efficient models in resource-constrained environments, such as mobile devices and edge computing platforms where both performance and memory efficiency are critical.
Our roadmap includes several promising directions for further optimization:
-
Industry Benchmarking: Compare our hand-written kernels against NVIDIA's production-grade
cusparseSpGEMMto quantify performance differences and identify remaining optimization opportunities. -
Auto-Tuning Framework: Develop a lightweight auto-tuner that dynamically selects optimal launch parameters (block sizes, shared-memory tile widths) based on input matrix characteristics.
-
Mixed Precision Support: Prototype variants that leverage Tensor Cores with FP16/BF16 inputs while maintaining numerical accuracy, potentially unlocking additional performance gains.
👥 Team Members: Artur Baboudjian, Jad Shaker, Mohammad Zbeeb