[DEPRECATED] Moved to ROCm/rocm-systems repo
-
Updated
May 4, 2026 - Python
[DEPRECATED] Moved to ROCm/rocm-systems repo
Online CUDA Occupancy Calculator
(Spring 2017) Assignment 2: GPU Executor
Runs a single CUDA/OpenCL kernel, taking its source from a file and arguments from the command-line
GPU Drano Static Analysis for GPU programs.
The official implementation for paper "AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation"
Prototype for a SPIR-V assembler and dissasembler. It provides a composable Java interface for generating SPIR-V code at runtime.
Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.
Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, and FP8 GEMM.
Real-time NVIDIA GPU command capture, decoding, and visualization
A self-hosted low-level functional-style programming language 🌀
High-performance GPU-accelerated C# scripting for Rhino Grasshopper, powered by ILGPU
Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).
Histogram-aware CUDA scheduling prototype for TensorRT-LLM-oriented MoE token routing and align subpaths.
CUDA fast paths for MoE dispatch and weighted combine with TensorRT-LLM-oriented trace replay benchmarks
22 progressive Triton GPU kernels, from elementwise ops to Flash Attention v2, featuring correctness tests and PyTorch throughput/TFLOPS benchmarks.
16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture
Add a description, image, and links to the gpu-kernels topic page so that developers can more easily learn about it.
To associate your repository with the gpu-kernels topic, visit your repo's landing page and select "manage topics."