kernel-skills is an open source library of high quality skill files for AI coding agents working on compute kernels.
kernel-skills is a curated collection of SKILL.md files. Each file is a structured engineering playbook that an AI coding agent can follow when writing, optimizing, debugging, or porting compute kernels.
The skills are the product. The repository also ships an npm package (@krxgu/kernel-skills) that wraps them in a versioned registry with a small CLI and TypeScript API. Use the npm package if you want to script skill discovery and bundling; otherwise just read or paste the Markdown directly.
- not a kernel compiler — it never invokes nvcc, ptxas, or hip-clang
- not a benchmark harness — it does not run kernels or measure performance
- not a model-serving runtime
- not an autonomous coding agent
- does not execute generated code
AI coding agents produce substantially worse kernel code when given vague prompts. They skip constraint gathering, choose incorrect tile strategies, ignore boundary conditions, make unsupported performance claims, and produce code that looks plausible but fails on real hardware.
Structured skill files change this. A well-authored skill forces the agent to:
- gather the right constraints before writing a single line of code
- choose the correct algorithm and memory strategy for the workload
- reason explicitly about correctness risks and edge cases
- avoid cargo-cult optimization and fake performance claims
- explain tradeoffs with technical precision
- know when a custom kernel is not the right answer
This repository exists to provide those skill files at expert quality, openly, for any agent and any workflow.
Same model. Same prompt. One difference: a kernel skill file. The naive softmax kernel fails on overflow and large shapes. The skill-guided version stays correct and bandwidth-competitive.
| Shape N | Naive · normal | Stable · normal | Naive · adversarial | Stable · adversarial |
|---|---|---|---|---|
| 64 | ✅ | ✅ | ❌ | ✅ |
| 128 | ✅ | ✅ | ❌ | ✅ |
| 256 | ✅ | ✅ | ❌ | ✅ |
| 257 | ❌ | ✅ | ❌ | ✅ |
| 512 | ❌ | ✅ | ❌ | ✅ |
| 1024 | ❌ | ✅ | ❌ | ✅ |
| 2048 | ❌ | ✅ | ❌ | ✅ |
| 4096 | ❌ | ✅ | ❌ | ✅ |
Naive adversarial: 8/8 shapes fail — NaN/Inf output, no max subtraction.
Naive normal for N > 256: 5/5 shapes fail — silent wrong output, no strided loop.
Stable after skill: 0/16 failures. Bandwidth within 1.2% of torch.softmax.
→ Full proof page with root-cause analysis and all charts
This repository is for engineers who use AI coding agents to work on:
- CUDA kernel development
- Triton kernel development
- Quantized kernels (int8, fp8)
- High performance numerics and AI workloads
- Kernel optimization, debugging, and porting
It is also useful for engineers who want a technical reference for how to approach these problems systematically, independent of any agent.
Install the package:
npm install @krxgu/kernel-skillsOr run any CLI command without installing:
npx @krxgu/kernel-skills listYou can also clone the repo and use the SKILL.md files directly — the npm package is a convenience layer on top of that source of truth.
kernel-skills list
kernel-skills list --category triton
kernel-skills search rmsnorm
kernel-skills show triton.write-triton-layernorm-kernel
kernel-skills path triton.write-triton-layernorm-kernel
kernel-skills bundle triton.write-triton-layernorm-kernel patterns.write-kernel-test-plan
kernel-skills categories
kernel-skills tagsFull reference: examples/cli-usage.md. Bundling guide: examples/agent-bundle-usage.md.
import { searchSkills, getSkill, bundleSkills } from "@krxgu/kernel-skills";
const matches = searchSkills("rmsnorm");
const skill = await getSkill("triton.write-triton-layernorm-kernel");
const bundle = await bundleSkills([
"triton.write-triton-layernorm-kernel",
"patterns.write-kernel-test-plan",
]);
console.log(bundle);Full API: examples/programmatic-usage.md.
kernel-skills/
├── README.md
├── LICENSE
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── ROADMAP.md
├── CLAUDE.md
├── package.json
├── tsconfig.json
├── .gitignore
├── src/ # TypeScript source (CLI + programmatic API)
│ ├── index.ts
│ ├── registry.ts
│ ├── search.ts
│ ├── bundle.ts
│ ├── cli.ts
│ ├── paths.ts
│ └── types.ts
├── scripts/ # build-time scripts
│ ├── generate-index.ts
│ └── validate-skills.ts
├── schema/
│ └── skill.schema.json # JSON Schema for skill.json metadata
├── generated/
│ └── skills.index.json # regenerated at build, ships in npm tarball
├── skills/ # source of truth for all skills
│ ├── cuda/
│ ├── triton/
│ ├── patterns/
│ ├── quantization/
│ ├── portability/
│ └── inference/
├── proof/ # measured before/after evidence per skill
│ ├── README.md
│ ├── cuda/
│ │ └── softmax/
│ │ ├── softmax-correctness.md
│ │ ├── hero-proof.png
│ │ ├── error-cliff.png
│ │ └── code-diff.png
│ ├── triton/
│ ├── patterns/
│ ├── quantization/
│ └── portability/
└── examples/
├── how-to-use-with-claude-code.md
├── how-to-use-with-chatgpt.md
├── how-to-use-with-cursor.md
├── how-to-use-with-gemini-cli.md
├── cli-usage.md
├── programmatic-usage.md
└── agent-bundle-usage.md
Each skills/<category>/<skill>/ directory contains both SKILL.md (the playbook) and skill.json (machine-readable metadata).
More skills are being added. See ROADMAP.md for what is coming next.
| Skill | Description |
|---|---|
write-cuda-gemm-kernel |
Design and implement a tiled CUDA GEMM kernel — shared memory strategy, tensor core eligibility, accumulation precision, and when to use cuBLAS/CUTLASS instead |
write-cuda-reduction-kernel |
Write a correct parallel reduction with warp shuffle tree, multi-block strategy, and correct handling of partial tiles |
write-cuda-softmax-kernel |
Implement online or two-pass softmax with numerically stable max subtraction and correct warp-level reduction |
write-cuda-layernorm-kernel |
Implement layer normalization with Welford online variance, fused mean/variance computation, and fp32 accumulation in fp16 kernels |
optimize-global-memory-access |
Analyze and fix coalescing, alignment, and vectorized load/store patterns using Nsight Compute metrics |
optimize-shared-memory-tiling |
Apply shared memory tiling with bank conflict analysis, padding strategies, and double buffering |
avoid-warp-divergence |
Classify avoidable vs unavoidable divergence, apply ballot/shuffle fast paths and stream compaction, estimate the real cost before restructuring |
choose-launch-configuration |
Select block size, grid size, and shared memory from occupancy analysis, register budget, and workload shape |
debug-cuda-kernel-correctness |
Systematic workflow for isolating indexing bugs, race conditions, reduction errors, dtype issues, and out-of-bounds accesses in CUDA kernels |
| Skill | Description |
|---|---|
write-triton-gemm-kernel |
Write a Triton GEMM kernel with correct block tiling, tl.dot accumulation, row/col-major loading, and when CUTLASS is preferable |
write-triton-softmax-kernel |
Implement numerically stable softmax in Triton with block size selection for the reduction axis and masking for variable sequence lengths |
write-triton-layernorm-kernel |
Implement LayerNorm in Triton with Welford online variance, persistent kernel pattern, and backward pass accumulation strategy |
write-triton-attention-kernel |
Implement Flash Attention in Triton — causal mask handling, kv-block loop structure, online softmax scaling, and fp16/bf16 accumulation decisions |
optimize-triton-block-parameters |
Select BLOCK_M/N/K, num_warps, and num_stages; reason about register pressure, occupancy, and autotuning config design |
Skills for the LLM serving hot path — Triton kernels for the building blocks of LLaMA-family transformers, plus pattern and integration-planning skills for vLLM and TensorRT.
| Skill | Description |
|---|---|
write-triton-rmsnorm-kernel |
Implement RMSNorm in Triton — one-pass sum-of-squares with fp32 accumulation, persistent kernel for typical LLM hidden sizes, forward + backward correctness |
write-triton-fused-add-rmsnorm-kernel |
Fuse residual-add + RMSNorm into one Triton kernel — single memory pass with the summed residual written back for the next transformer block |
write-triton-silu-mul-kernel |
Implement silu(a) * b (SwiGLU) in Triton — fp32 sigmoid for stability, GeGLU/ReGLU variants, bandwidth-bound tuning |
write-triton-rope-kernel |
Apply Rotary Position Embeddings to Q/K — GPT-NeoX vs GPT-J layout disambiguation, continuous-batching position handling, fp32 cos/sin tables |
write-triton-sampling-kernel |
Decode-time token sampling in Triton — temperature, top-k, top-p (nucleus), multinomial draw, with heterogeneous per-request sampling parameters |
write-triton-kv-cache-append-kernel |
Append new K/V into the KV cache during decode — contiguous and paged (vLLM-style) layouts, GQA, fp8 KV cache scaling |
write-triton-dequant-kernel |
Dequantize int4/int8 weights to fp16/bf16 — AWQ/GPTQ/NF4/int8 schemes, bit-unpacking, per-group scale/zero arithmetic, when to fuse into a matmul instead |
optimize-prefill-vs-decode-kernels |
Reason about prefill (compute-bound, large M) vs decode (memory-bound, M=1) regimes — kernel family, tile shape, split strategy, continuous batching, speculative decoding |
write-vllm-custom-op-integration-plan |
Plan a custom CUDA/Triton kernel integration into vLLM — paged KV cache, CUDA graph capture, tensor parallelism, model-runner hooks, benchmark strategy |
write-tensorrt-plugin-integration-plan |
Plan a TensorRT plugin around a custom CUDA kernel — IPluginV3 vs IPluginV2 choice, plugin lifecycle, dynamic shapes, serialization, FP16/INT8/FP8 |
| Skill | Description |
|---|---|
fuse-elementwise-ops |
Decide when and how to fuse elementwise operations — memory bandwidth arithmetic, producer-consumer fusion, and epilogue fusion patterns |
write-numerically-stable-kernel |
Apply Kahan summation, log-sum-exp trick, compensated accumulation, and dtype selection for stable intermediate values |
handle-boundary-conditions |
Handle partial tiles, misaligned sizes, and out-of-bounds accesses correctly — masked loads, predicated stores, and tail handling strategies |
choose-tile-size-and-work-partitioning |
Reason about arithmetic intensity, shared memory budget, occupancy tradeoffs, and work partitioning for irregular shapes |
write-kernel-test-plan |
Design a correctness and numerical test plan — reference comparison strategy, input shape sweep, dtype coverage, tolerance reasoning, and CI integration |
| Skill | Description |
|---|---|
write-int8-quantized-kernel |
Implement INT8 quantized matrix operations — dp4a instruction, symmetric vs asymmetric quantization, INT32 accumulation, per-channel scale epilogue, cuBLAS vs CUTLASS vs custom decision |
write-fp8-kernel |
Design FP8 compute kernels for Hopper/Ada — E4M3/E5M2 format selection, satfinite conversion, delayed scaling, WGMMA on H100, and hipBLASLt on MI300X |
debug-quantized-kernel-accuracy |
Diagnose accuracy regressions in quantized kernels — scale validation, overflow detection, per-element error attribution, and calibration diagnostics |
| Skill | Description |
|---|---|
port-cuda-kernel-to-triton |
Systematically translate a CUDA kernel to Triton — execution model mapping, warp primitives to tl.reduce, shared memory to block-scoped accumulators |
port-cuda-kernel-to-hip |
Port CUDA to HIP/ROCm — wavefront width differences, 64-bit ballot masks, WMMA to rocWMMA, hipify audit checklist for MI250/MI300X targets |
write-backend-agnostic-kernel-plan |
Plan a kernel that must run on NVIDIA and AMD — abstraction strategy, portability risk register, per-backend tile sizing, and CI matrix |
- Find the skill that matches your task in
skills/. - Open the
SKILL.mdfile and paste its full contents into your agent's context. - Ask the agent to perform the task.
The skill does not replace your prompt — it forces the agent to reason correctly before writing a single line of code.
<paste contents of skills/cuda/write-cuda-reduction-kernel/SKILL.md>
Write a warp-shuffle reduction kernel for float32 inputs on an H100.
Input shape: [B=32, N=65536]. Output: [B] row-wise sums.
The skill works the same way with ChatGPT, Cursor, Gemini CLI, and any other agent that accepts context.
| Agent | Guide |
|---|---|
| Claude Code | examples/how-to-use-with-claude-code.md |
| ChatGPT | examples/how-to-use-with-chatgpt.md |
| Cursor | examples/how-to-use-with-cursor.md |
| Gemini CLI | examples/how-to-use-with-gemini-cli.md |
Every skill ships with a skill.json next to its SKILL.md. Example:
{
"id": "triton.write-triton-layernorm-kernel",
"name": "Write Triton LayerNorm Kernel",
"category": "triton",
"summary": "Implement LayerNorm in Triton with Welford online variance, persistent kernel pattern, and backward pass accumulation strategy.",
"tags": ["triton", "layernorm", "normalization", "welford"],
"difficulty": "intermediate",
"hardware": ["nvidia", "amd"],
"languages": ["python", "triton"],
"version": "0.1.0",
"entry": "skills/triton/write-triton-layernorm-kernel/SKILL.md"
}The full schema is in schema/skill.schema.json. The build aggregates every skill.json into generated/skills.index.json, which is what the CLI and programmatic API read from.
Allowed categories: cuda, triton, patterns, quantization, portability, inference.
Allowed difficulty values: beginner, intermediate, advanced.
- Create
skills/<category>/<skill-name>/SKILL.mdfollowing the 11-section template documented in CONTRIBUTING.md. - Create
skills/<category>/<skill-name>/skill.jsonwith the metadata fields above. - Run
npm run validate:skillsto confirm the metadata is well-formed and theSKILL.mdexists. - Run
npm run generate:indexto regenerategenerated/skills.index.json. - Open a pull request.
The validator rejects: missing skill.json, missing required fields, duplicate ids, unknown categories, mismatched parent folder vs category, empty tags arrays, invalid difficulty, unparseable JSON, missing entry files, and SKILL.md files smaller than 400 bytes.
Before publishing:
npm install
npm run generate:index
npm run validate:skills
npm run build
npm run test
npm run publish:dry-runInspect the dry-run output and confirm only dist/, skills/, generated/, schema/, examples/, README.md, LICENSE, and package.json are included.
First publish (scoped public package):
npm login
npm publish --access publicSubsequent versions:
npm version patch # or minor / major
npm publishSemantic versioning, applied to package behavior:
- patch — typo fixes, metadata fixes, non-breaking skill improvements
- minor — new skills, new CLI commands, new API helpers
- major — breaking changes to the metadata schema, CLI behavior, or API return types
Each skill.json also carries its own version field for fine-grained tracking of individual skill revisions.
Contributions are welcome. Before opening a pull request, read CONTRIBUTING.md.
The short version: open an issue first to propose the skill scope, follow the required 11-section SKILL.md template, meet the quality bar, and keep naming conventions consistent.
Low-quality, vague, or out-of-scope skill files will not be merged regardless of technical domain.
More skills are being added across CUDA, Triton, quantization, and portability. Following the quality-first principle: each skill ships only when it is genuinely better than a generic prompt.
See ROADMAP.md for the full plan.
MIT. See LICENSE.

