Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records by learning-chip · Pull Request #88 · huawei-csl/pto-kernels

learning-chip · 2026-04-05T16:37:43Z

TL;DR This is a much (3~5x) faster version of triton chunk_o in vllm-ascend.

Note on the chosen algorithm: it mirrors the chunk_fwd_o part of chunk_gated_delta_rule_fwd used in vllm-ascend's Qwen3.5 prefill. The "chunk_o" part is as simple as Gated Linear Attention (same as Mamba-2). This PR does not consider the "delta rule" part yet, which is only shown in "chunk_h" phase.

Step-by-step perf optimizations

Initial static-shape starting point -- c226f0a (generated from tilelang linear_attention_causal, directly compiling dynamic-shape in tilelang fails, I sent a few PRs to tilelang)
First dynamic-shape version, ~1 TFLOP/s -- 41aeecc
inline all wrappers and reduce branching, ~5 TFLOP/s -- 5f1ac35
precompute and cache causal mask, ~30 TFLOP/s -- a9b54ed
chunk size 64 -> 128, got ~50 TFLOP/s -- bd954f9
L0 ping-pong buffer, got ~58 TFLOP/s --7b811b0
two-slot C-V pipelining, got ~75 TFLOP/s -- 3350511
L1 prefetching, got ~78 TFLOP/s and ~570 GiB/s bandwidth -- 26aac37

Added optimize_step_by_step directory to reproduce the above step-by-step performance gains, but at latest commit.

Feature list:

Dynamic batch/seq dim, templated head/hidden dim
Compare precision and performance to triton baseline (see triton_baseline/performance_summary.md, PTO version is 3~4x faster)
Support BSND (seq-first) and BNSD (head-first) layouts (seq-first works but drops TFLOPs from ~75 to ~60)
Variable seqlen input for BSND case
Support scalar gating factor, matching FLA's simple GLA chunk

Compiles and runs correctly using the pto-isa headers in /usr/local/Ascend/cann-8.5.1/include (CANN version in quay.io/ascend/vllm-ascend:v0.18.0rc1 package). GitCode tag 8.5.0 also compiles and runs fine.

since commit 382153e, newer PTO-ISA header also works, now tested with this 04/03 commit

Remaining issues for this PR

Although unit tests all pass, still need careful human check/cleaning/annotation of those largely AI-generated code.
- better keep this PR as-is, and manually distill the minimum & cleanest code in a separate PR and merge.
Getting compile error using a newer pto-isa header on 04/03 commit, to fix later.
- Now fixed by adding #include <pto/common/pto_tile.hpp> to the kernel sources, and change the use of TTRI. See commit -- 382153e

Minor issues:
To avoid slow mask construction using scalar core loops, currently the causal mask is precomputed in passed-in as extra arg.

in vllm-ascend the on-SRAM mask is built by:

        o_i = tl.arange(0, BT).to(tl.float32)
        m_A = o_i[:, None] >= o_i[None, :]

Need to pick an efficient PTO instruction for this masking
- now resolved by 1cbd6b8, 0d4118e, 28d144b, no need to pass pre-computed mask anymore.

Remaining issues for future PRs

On algorithm side:

Add "chunk_h" part (including chunk_scaled_dot_kkt_fwd/recompute_w_u_fwd/chunk_gated_delta_rule_fwd_h) for GatedDeltaNet and Kimi Delta Attention -- see Complete chunkwise GatedDeltaNet #91
Merge into one "GDN layer megakernel" and integrate into vllm/sglang
Generalize to the entire FLA repo collections (provide agent some samples of Triton GPU -> PTO-ISA porting, this PR might be enough)
Support backward kernel for training

On C++ framework side:

Test new TPUSH/TPOP abstraction for C-V communication
Test generalization to A5 backend.
Test new AUTO mode for synchronization

On Python DSL framework side:

Implement this simple chunk_o part of linear attention as the first useful mix kernel example in pto-dsl
Enable ptoas auto-sync pass and compare with manual plan

+    sys.path.insert(0, str(COMMON_DIR))
+
+import torch
+import torch_npu  # noqa: F401


+from statistics import median
+
+import torch
+import torch_npu  # noqa: F401


+
+from functools import lru_cache
+
+from jit_shared import BLOCK_DIM, OPTIMIZED_KERNEL_FLAGS, compile_cpp as shared_compile_cpp


+from functools import lru_cache
+
+from jit_shared import BLOCK_DIM, OPTIMIZED_KERNEL_FLAGS, compile_cpp as shared_compile_cpp
+from jit_shared import get_causal_mask, load_dynamic_mask_lib


@@ -0,0 +1,259 @@
+import argparse
+import importlib.util
+import os


+from statistics import median
+
+import torch
+import torch_npu  # noqa: F401


+    cu_seqlens: torch.LongTensor | None = None,
+    chunk_size: int = 64,
+) -> torch.Tensor:
+    b, t, hg, k_dim, v_dim = *q.shape, v.shape[-1]


+from typing import Literal
+
+import torch
+import torch_npu  # noqa: F401


+from typing import Optional
+
+import torch
+import torch_npu  # noqa: F401


+
+from functools import lru_cache
+
+from jit_shared import BLOCK_DIM, compile_cpp as shared_compile_cpp


…triton kernel feature list

…ding

…n BSND kernel

github-code-quality Bot found potential problems Apr 5, 2026

View reviewed changes

Comment thread examples/jit_cpp/linear_attention/run_linear_attention.py Fixed

github-code-quality Bot found potential problems Apr 5, 2026

View reviewed changes

Comment thread examples/jit_cpp/linear_attention/benchmark_linear_attention.py Fixed

Comment thread examples/jit_cpp/linear_attention/run_linear_attention.py Fixed

learning-chip changed the title ~~WIP Linear Attention~~ Chunkwise linear attention with step-by-step optimization to reach ~80 TFLOP/s Apr 5, 2026

learning-chip changed the title ~~Chunkwise linear attention with step-by-step optimization to reach ~80 TFLOP/s~~ Chunkwise linear attention reaching ~80 TFLOP/s with step-by-step optimization history Apr 5, 2026

learning-chip marked this pull request as ready for review April 5, 2026 22:26

github-code-quality Bot found potential problems Apr 5, 2026

View reviewed changes

github-code-quality Bot found potential problems Apr 6, 2026

View reviewed changes

learning-chip requested review from asobczyk, gioelegott and zouzias April 6, 2026 21:40

learning-chip changed the title ~~Chunkwise linear attention reaching ~80 TFLOP/s with step-by-step optimization history~~ Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records Apr 7, 2026

github-code-quality Bot found potential problems Apr 7, 2026

View reviewed changes

Comment thread examples/jit_cpp/linear_attention/triton_baseline/chunk_o.py

from typing import Optional

import torch

import torch_npu # noqa: F401

This was referenced Apr 7, 2026

Complete chunkwise GatedDeltaNet #91

Open

feat: Cross-core comm with TPUSH/TPOP huawei-csl/pto-dsl#98

Merged

learning-chip force-pushed the linear_attn branch from 382153e to f2a58d5 Compare April 14, 2026 15:35

github-code-quality Bot found potential problems Apr 14, 2026

View reviewed changes

Comment thread ...p/linear_attention/optimize_step_by_step/02_naive_dynamic_shape/jit_util_linear_attention.py

from functools import lru_cache

from jit_shared import BLOCK_DIM, compile_cpp as shared_compile_cpp

learning-chip added 12 commits April 15, 2026 08:05

minimum static linear_attention example

4ceb70b

support dynamic B and L dims

5d3c8a4

minimum benchmark script

b0fad09

a shorter version with faster performance

df608ea

precompute mask to reduce scalar overhead

ac39df3

add lessons and action plans

07a93be

use chunk size 128 to get higher TFLOPs

5a774a2

L0 ping pong buffer

da5e18d

two-slot C-V pipelining for higher FLOPs

4cc8830

L1 prefetching for slight higher FLOPs

77902c9

explore remaining optim items, and clean-up lib path

b29a9a8

step-by-step optimization history

c65ecc0

learning-chip added 19 commits April 15, 2026 08:05

simplify step 1 and 2 kernel source to be more reader friendly

e9c6622

educational comments

fedd9ac

move duplicated code to common utils

2197b14

refresh README and optimization lessions

f60a67c

compare performance to triton reference

57cd5dd

improve triton baseline by taking from vllm-ascend

e17f95d

match custom triton kernel perf to vllm extracted one

8263937

add PTO chunk=128 cases to comparison table

6cbc319

minor change to PTO-ISA comments and benchmark settings

d6d4815

support scalar gating factor, BSND input, var-length batch, to match …

8d651ed

…triton kernel feature list

handle tail elements using partial LOAD/STORE without python-side pad…

00e8006

…ding

pipelining improvements for BSND version

a87a5f5

update documents for optimized BSND performance

29d207c

compare pre-computed mask in triton

8607314

fast on-the-fly mask construction

7377b3e

use pto-isa TTRI instead of intrinsics

08621a7

no need to set Path, just PTO_LIB_PATH=/workdir/pto-isa

6e6c181

enable on-the-fly fast mask construction for the main optimized varle…

01d0e73

…n BSND kernel

fix comple error with pto-isa master around April 03

5bef62d

learning-chip force-pushed the linear_attn branch from f2a58d5 to 5bef62d Compare April 15, 2026 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records#88

Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records#88
learning-chip wants to merge 31 commits intomainfrom
linear_attn

learning-chip commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		from functools import lru_cache

		from jit_shared import BLOCK_DIM, OPTIMIZED_KERNEL_FLAGS, compile_cpp as shared_compile_cpp


		from functools import lru_cache

		from jit_shared import BLOCK_DIM, compile_cpp as shared_compile_cpp

Conversation

learning-chip commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Remaining issues for this PR

Remaining issues for future PRs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

learning-chip commented Apr 5, 2026 •

edited

Loading