Triton Fused Ops

High-performance Triton kernels for Transformer inference workloads.

📖 Docs | 🇨🇳 中文 | 💡 Examples | 🤝 Contributing

What this repository provides

triton-fused-ops focuses on three kernel families:

fused_rmsnorm_rope: RMSNorm + RoPE in one kernel
fused_gated_mlp: gated MLP fusion (SiLU/GELU)
fp8_gemm + quantization helpers: FP8 matrix multiplication pipeline

The goal is to reduce redundant memory traffic and keep a practical, testable integration surface.

Runtime boundaries

GPU is required for Triton kernel execution and performance benchmarking.
CPU-only environments can still run import/type/lint/unit checks (CI uses this path).
Performance numbers depend on GPU architecture, model shape, and batch size.

Installation

git clone https://github.com/LessUp/triton-fused-ops.git
cd triton-fused-ops
pip install -e ".[dev]"

Quick checks

Import check (works in CPU-only environments):

python -c "import triton_ops; print(triton_ops.__version__)"

CPU-safe validation baseline:

ruff format --check .
ruff check .
mypy triton_ops/
pytest tests/ -v -k "not cuda and not gpu" --ignore=tests/benchmarks/
python3 -m build

Minimal usage (GPU)

import torch
from triton_ops import fused_rmsnorm_rope

x = torch.randn(2, 128, 4096, device="cuda", dtype=torch.float16)
weight = torch.ones(4096, device="cuda", dtype=torch.float16)
cos = torch.randn(128, 128, device="cuda", dtype=torch.float16)
sin = torch.randn(128, 128, device="cuda", dtype=torch.float16)

y = fused_rmsnorm_rope(x, weight, cos, sin)
print(y.shape)

Benchmark note

Representative benchmark snapshots in this repository are measured on NVIDIA A100 (CUDA 12.1) and are intended as directional references, not universal guarantees.

Kernel	Typical speedup vs unfused/reference path
`fused_rmsnorm_rope`	up to ~3x
`fused_gated_mlp`	~1.3x–1.8x
`fp8_gemm`	~1.2x–1.5x

See tests/benchmarks/ and docs for setup details.

Development workflow

This repository is OpenSpec-driven for non-trivial work:

Create/select an OpenSpec change
Complete proposal/design/specs/tasks
Implement task-by-task
Run review and validation

See:

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.claude		.claude
.devcontainer		.devcontainer
.githooks		.githooks
.github		.github
.vscode		.vscode
_includes		_includes
assets		assets
changelog		changelog
docs		docs
examples		examples
openspec		openspec
tests		tests
triton_ops		triton_ops
.editorconfig		.editorconfig
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CHANGELOG.zh-CN.md		CHANGELOG.zh-CN.md
CLAUDE.md		CLAUDE.md
CONTEXT.md		CONTEXT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
_config.yml		_config.yml
docker-compose.yml		docker-compose.yml
index.md		index.md
pyproject.toml		pyproject.toml
robots.txt		robots.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton Fused Ops

What this repository provides

Runtime boundaries

Installation

Quick checks

Minimal usage (GPU)

Benchmark note

Development workflow

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Triton Fused Ops

What this repository provides

Runtime boundaries

Installation

Quick checks

Minimal usage (GPU)

Benchmark note

Development workflow

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages