Skip to content

LessUp/triton-fused-ops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Triton Fused Ops

CI Pages License: MIT Python 3.9+ PyTorch 2.0+ Triton 2.1+

High-performance Triton kernels for Transformer inference workloads.

📖 Docs | 🇨🇳 中文 | 💡 Examples | 🤝 Contributing


What this repository provides

triton-fused-ops focuses on three kernel families:

  • fused_rmsnorm_rope: RMSNorm + RoPE in one kernel
  • fused_gated_mlp: gated MLP fusion (SiLU/GELU)
  • fp8_gemm + quantization helpers: FP8 matrix multiplication pipeline

The goal is to reduce redundant memory traffic and keep a practical, testable integration surface.

Runtime boundaries

  • GPU is required for Triton kernel execution and performance benchmarking.
  • CPU-only environments can still run import/type/lint/unit checks (CI uses this path).
  • Performance numbers depend on GPU architecture, model shape, and batch size.

Installation

git clone https://github.com/LessUp/triton-fused-ops.git
cd triton-fused-ops
pip install -e ".[dev]"

Quick checks

Import check (works in CPU-only environments):

python -c "import triton_ops; print(triton_ops.__version__)"

CPU-safe validation baseline:

ruff format --check .
ruff check .
mypy triton_ops/
pytest tests/ -v -k "not cuda and not gpu" --ignore=tests/benchmarks/
python3 -m build

Minimal usage (GPU)

import torch
from triton_ops import fused_rmsnorm_rope

x = torch.randn(2, 128, 4096, device="cuda", dtype=torch.float16)
weight = torch.ones(4096, device="cuda", dtype=torch.float16)
cos = torch.randn(128, 128, device="cuda", dtype=torch.float16)
sin = torch.randn(128, 128, device="cuda", dtype=torch.float16)

y = fused_rmsnorm_rope(x, weight, cos, sin)
print(y.shape)

Benchmark note

Representative benchmark snapshots in this repository are measured on NVIDIA A100 (CUDA 12.1) and are intended as directional references, not universal guarantees.

Kernel Typical speedup vs unfused/reference path
fused_rmsnorm_rope up to ~3x
fused_gated_mlp ~1.3x–1.8x
fp8_gemm ~1.2x–1.5x

See tests/benchmarks/ and docs for setup details.

Development workflow

This repository is OpenSpec-driven for non-trivial work:

  1. Create/select an OpenSpec change
  2. Complete proposal/design/specs/tasks
  3. Implement task-by-task
  4. Run review and validation

See:

License

MIT.