Skip to content

jadshaker/SpMSpM-GPU

Repository files navigation

SpMSpM: High-Performance Sparse Matrix Multiplication on GPUs


💻 Performance-Driven GPU Computing

This repository contains the source code for our optimized Sparse Matrix-Sparse Matrix Multiplication (SpMSpM) kernels, developed as part of a high-performance computing project. We tackled one of the most computationally challenging problems in machine learning and graph analytics: efficiently multiplying two sparse matrices where the second matrix is "thin."

Our team employed a systematic, data-driven approach to performance optimization, developing and refining a suite of GPU kernels that achieved significant performance improvements compared to baseline implementations. Through iterative development across multiple kernel versions, we identified key optimization strategies that deliver substantial speedups.


📊 Key Results & Performance Metrics

Our best-performing implementation, kernel5.cu, represents the culmination of our optimization efforts. The performance results demonstrate clear improvements across all test datasets:

Kernel Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset 6
kernelCPU0.cu 8.945 ms 33.307 ms 66.646 ms 99.553 ms 270.835 ms 914.210 ms
kernel5.cu 0.150 ms 0.424 ms 0.655 ms 0.861 ms 2.18 ms 4.79 ms

Our optimized GPU kernel achieved up to 190x speedup 🔥 on the largest dataset compared to our initial CPU implementation, demonstrating the effectiveness of our GPU optimization strategies.


⚡ Technical Innovations

Our performance gains stem from implementing three key optimization techniques in kernel5.cu:

  • Shared-Memory Tiling: We implemented a per-block shared memory buffer that accumulates partial results for an entire output row before writing to global memory. This approach loads each DRAM element exactly once per block and replaces hundreds of fine-grained global writes with just two global operations per non-zero output, significantly reducing global memory traffic and pipeline stalls.

  • Privatization: Instead of having every thread contend on a global non-zero counter, we use a single shared-memory counter per block to track non-zero entries. All atomic operations occur in low-latency shared memory, with only one global atomic operation per block. This approach dramatically reduces contention and avoids costly pipeline stalls from repeated global atomics.

  • Memory Coalescing & Occupancy Tuning: Our kernels ensure consecutive threads access consecutive memory locations using strided loops, achieving fully coalesced global memory transactions. We optimized block sizes (BLOCK_DIM = 64) through systematic profiling to maximize GPU resource utilization and overall throughput.


🛠️ Instructions

To compile:

make

To run:

./spmspm [flags]

Optional flags:

  -d <dataset>      the prefix of the files containing the two matrices to be multiplied
                        the first matrix will be read from <dataset>-matrix1.txt
                        the second matrix will be read from <dataset>-matrix2.txt
                        the reference output matrix will be read from <dataset>-matrix3.txt
  -a                run CPU version 0
  -b                run CPU version 1
  -0                run GPU version 0
  -1                run GPU version 1
  -2                run GPU version 2
  -3                run GPU version 3
  -4                run GPU version 4
  -5                run GPU version 5
  -6                run GPU version 6
  -7                run GPU version 7
  -8                run GPU version 8
  -9                run GPU version 9
                    NOTE: It is okay to specify multiple versions in the same run
  -v                perform exact verification
  -w                write result to a file
                        the CPU result will be written to <dataset>-matrix3-cpu.txt
                        the GPU result will be written to <dataset>-matrix3-gpu.txt

🌟 Real-World Applications

Our kernels have practical applications in pruned and quantized dense layers for machine learning models. When neural network layers are compressed using techniques that reduce density below approximately 10%, our shared-memory, row-privatized kernel delivers order-of-magnitude speedups over dense GEMM operations while using significantly less memory.

This makes our implementation particularly valuable for deploying efficient models in resource-constrained environments, such as mobile devices and edge computing platforms where both performance and memory efficiency are critical.


🎯 Future Development Plans

Our roadmap includes several promising directions for further optimization:

  • Industry Benchmarking: Compare our hand-written kernels against NVIDIA's production-grade cusparseSpGEMM to quantify performance differences and identify remaining optimization opportunities.

  • Auto-Tuning Framework: Develop a lightweight auto-tuner that dynamically selects optimal launch parameters (block sizes, shared-memory tile widths) based on input matrix characteristics.

  • Mixed Precision Support: Prototype variants that leverage Tensor Cores with FP16/BF16 inputs while maintaining numerical accuracy, potentially unlocking additional performance gains.


👥 Team Members: Artur Baboudjian, Jad Shaker, Mohammad Zbeeb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors