High-performance Triton kernels for Transformer inference workloads.
📖 Docs | 🇨🇳 中文 | 💡 Examples | 🤝 Contributing
triton-fused-ops focuses on three kernel families:
fused_rmsnorm_rope: RMSNorm + RoPE in one kernelfused_gated_mlp: gated MLP fusion (SiLU/GELU)fp8_gemm+ quantization helpers: FP8 matrix multiplication pipeline
The goal is to reduce redundant memory traffic and keep a practical, testable integration surface.
- GPU is required for Triton kernel execution and performance benchmarking.
- CPU-only environments can still run import/type/lint/unit checks (CI uses this path).
- Performance numbers depend on GPU architecture, model shape, and batch size.
git clone https://github.com/LessUp/triton-fused-ops.git
cd triton-fused-ops
pip install -e ".[dev]"Import check (works in CPU-only environments):
python -c "import triton_ops; print(triton_ops.__version__)"CPU-safe validation baseline:
ruff format --check .
ruff check .
mypy triton_ops/
pytest tests/ -v -k "not cuda and not gpu" --ignore=tests/benchmarks/
python3 -m buildimport torch
from triton_ops import fused_rmsnorm_rope
x = torch.randn(2, 128, 4096, device="cuda", dtype=torch.float16)
weight = torch.ones(4096, device="cuda", dtype=torch.float16)
cos = torch.randn(128, 128, device="cuda", dtype=torch.float16)
sin = torch.randn(128, 128, device="cuda", dtype=torch.float16)
y = fused_rmsnorm_rope(x, weight, cos, sin)
print(y.shape)Representative benchmark snapshots in this repository are measured on NVIDIA A100 (CUDA 12.1) and are intended as directional references, not universal guarantees.
| Kernel | Typical speedup vs unfused/reference path |
|---|---|
fused_rmsnorm_rope |
up to ~3x |
fused_gated_mlp |
~1.3x–1.8x |
fp8_gemm |
~1.2x–1.5x |
See tests/benchmarks/ and docs for setup details.
This repository is OpenSpec-driven for non-trivial work:
- Create/select an OpenSpec change
- Complete proposal/design/specs/tasks
- Implement task-by-task
- Run review and validation
See:
MIT.