gpu-kernels

Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.

cuda high-performance-computing triton quantization rocm gpu-kernels prompt-engineering llm-agents ai-coding kernel-optimization

Updated May 7, 2026
TypeScript

LessUp / triton-fused-ops

Star

Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, and FP8 GEMM.

Updated Apr 29, 2026
Python

meowmeowxw / nvidia-kernel-detective

Star

Real-time NVIDIA GPU command capture, decoding, and visualization

real-time kernel nvidia linux-kernel-driver nvidia-gpu gpu-kernels blackwell-gpu

Updated Apr 14, 2026
C

hliadis / High-Performance-Computing

Star

c hpc gpu optimization openmp cuda gpu-kernels

Updated Dec 15, 2022
C

anoojpatel / metaxu

Star

A self-hosted low-level functional-style programming language 🌀

algebraic-effects functional-programming self-hosted algebraic-data-types compilers python-compiler gpu-kernels borrow-checker mutable-value-semantics

Updated Sep 25, 2025
Python

sean1832 / Macho

Star

High-performance GPU-accelerated C# scripting for Rhino Grasshopper, powered by ILGPU

grasshopper3d gpu-kernels gpu-programming rhino3d grasshopper-plugin ilgpu scripting-tool

Updated Mar 31, 2025
C#

PwnKit-Labs / noeris

Star

Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).

benchmarking cuda pytorch triton autotuning gemma gpu-kernels github-actions kernel-fusion llm-training llm-inference kernel-optimization

Updated May 7, 2026
Python

LongWeihan / trtllm-moe-histogram-scheduler

Star

Histogram-aware CUDA scheduling prototype for TensorRT-LLM-oriented MoE token routing and align subpaths.

scheduling cuda pytorch moe gpu-kernels llm-inference tensorrt-llm

Updated Apr 26, 2026
Python

LongWeihan / trtllm-moe-fastpath-kernels

Star

CUDA fast paths for MoE dispatch and weighted combine with TensorRT-LLM-oriented trace replay benchmarks

cuda pytorch moe gpu-kernels llm-inference qwen tensorrt-llm

Updated Apr 26, 2026
Python

Bhavikupadhyay / triton-kernels

Star

22 progressive Triton GPU kernels, from elementwise ops to Flash Attention v2, featuring correctness tests and PyTorch throughput/TFLOPS benchmarks.

deep-learning cuda pytorch matrix-multiplication high-performance-computing triton performance-benchmarking gpu-kernels flash-attention

Updated Mar 25, 2026
Jupyter Notebook

AregGevorgyan / JaxonFlow

Star

Alternate backend for JAX and PyTorch that generates optimized kernels using AI agents

ai pytorch ai-agents gpu-kernels jax llm

Updated Feb 3, 2026
Python

kalyani-25 / Reimplementation_flash-attention-from-scratch

Star

16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture

deep-learning cuda pytorch ampere gpu-kernels nsight llm-inference flashattention

Updated Mar 6, 2026
Cuda

martini9393 / gpu-executor

Star

Assignment 2: GPU Executor

computation-graph gpu-kernels

Updated May 12, 2017
Python

Improve this page

Add a description, image, and links to the gpu-kernels topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the gpu-kernels topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-kernels

Here are 31 public repositories matching this topic...

ROCm / rocprofiler-compute

xmartlabs / cuda-calculator

dlsys-course / assignment2-2017

eyalroz / gpu-kernel-runner

upenn-acg / gpudrano-static-analysis_v1.0

StigLidu / AdaExplore

beehive-lab / beehive-spirv-toolkit

KrxGu / kernel-skills

LessUp / triton-fused-ops

meowmeowxw / nvidia-kernel-detective

hliadis / High-Performance-Computing

anoojpatel / metaxu

sean1832 / Macho

PwnKit-Labs / noeris

LongWeihan / trtllm-moe-histogram-scheduler

LongWeihan / trtllm-moe-fastpath-kernels

Bhavikupadhyay / triton-kernels

AregGevorgyan / JaxonFlow

kalyani-25 / Reimplementation_flash-attention-from-scratch

martini9393 / gpu-executor

Improve this page

Add this topic to your repo