Complete chunkwise GatedDeltaNet by learning-chip · Pull Request #91 · huawei-csl/pto-kernels

learning-chip · 2026-04-07T21:32:21Z

Finish all the rest part of #88 to support full Qwen3.5 GDN layer.

Reproduce: Compiles and runs with pto-isa commit on April 03. I used this modified vllm-ascend docker image, with triton-ascend pre-installed, so it's easier to compare against triton baseline in vllm.

Performance

Shape: (N_seq=16, L_seg=16384, H=16, DK=DV=128, C=128), packed varlen
BSND with T=262144.

Kernel	PTO (ms)	Triton (ms)	Speedup	TFLOPS
chunk_cumsum	0.34	1.02	3.00x	0.012
chunk_scaled_dot_kkt	2.78	4.84	1.74x	24.8
wy_fast	6.85	15.63	2.28x	20.1
chunk_h	9.43	30.83	3.27x	29.1
chunk_o	11.35	16.15	1.42x	30.3
total	30.75	68.47	2.23x	26.8

Reproduced by chunk_gdn/dynamic_bsnd vs chunk_gdn/triton_baseline

Accuracy evaluation

Per-stage accuracy eval: fig_stage_scatter.zip
Chained all kernel together: e2e_scatter_plot.zip, e2e_metrics_latest.csv

varlen_1,63,64,65,127,128,129,447,512,640,1920_long_ladder

Reproduced by chunk_gdn/pto_e2e_measure

Feature list

Basic BNSD static shape that passes e2e GDN unit test (See chunk_gdn/static_baseline/gdn_chain_e2e_static.py)
Support BSND varlen that matches triton kernel API used in vllm/sglang (See chunk_gdn/dynamic_bsnd)
Performance tuning (e.g. C-V pipelining, L1/L0 double buffering)
Performance comparison to Triton baseline
Merge into one single "megakernel" launch
Deploy into vllm-ascend and verify e2e -- tested small Qwen model in patch_vllm_pto
Support Grouped Value Attention where num_key_head < num_value_head (required by larger Qwen)

…terface other stages without transpose

asobczyk

I was able to take a look in about ~20 of the 113 files, I left some comments / suggestions.
In general it looks good, but i have some general remarks:

Must-change: The changes under csrc/kernel/kernel_tri_inv_rec_unroll.cpp should be thoroughly examined and tested in a separate, isolated PR, with dedicated unit tests.
Nice-to-have: I would avoid special characters in the source code files, such as arrows, " \mathbb{R} ", or greek letters. It is better to be consistent with the variable names that are used by the functions
Nice-to-have: Doxygen-style docstrings are missing -- The current descriptions/docstrings could be translated to doxy-style
Nice-to-have: Ideally, the main kernels that are used should be ported to csrc/kernels. One PR per kernel, with source code, torch integration, and unit tests. I know that this is a devious work so for now I do not mind if we do it in a separate PR

asobczyk · 2026-04-21T11:31:36Z

 AICORE inline void CopyOddOrEvenBlocksL1ToL0(SrcL1TileT src, DstL0TileT dst,
-                                             uint32_t block_size) {
+                                             uint32_t block_size,
+                                             bool swap_parity = false) {


the changes in this file (in any csrc/ file) should go to a separate MR with unit tests

I also believe that if the swap_parity is only used for deciding between upper/lower triangular then we can implement it with a much more seemless way just by reading in row-major vs column major manner

asobczyk · 2026-04-21T11:32:04Z

-  // For left: copy even blocks 0, 2, 4, ... (starting_block=0)
-  // For right: copy odd blocks 1, 3, 5, ... (starting_block=1)
-  const uint32_t starting_block_index = is_left ? 0 : 1;
+  // Default: left→even(0), right→odd(1). swap_parity flips this.


It might be better to avoid special characters such as →

asobczyk · 2026-04-21T11:37:22Z

@@ -0,0 +1,263 @@
+#!/usr/bin/env python3
+"""
+Benchmark dynamic BSND PTO kernels (bisheng-compiled, ctypes) for chunk GDN.


It would be helpful to expand the description here about what is being benchmarked, and how (a birds-eye view)

asobczyk · 2026-04-21T11:40:35Z

+//   stream = NPU stream for async execution (like CUDA stream)
+//   rtGetC2cCtrlAddr: gets the FFTS control address for cross-core sync
+//   <<<block_dim, nullptr, stream>>>: NPU kernel launch syntax (like CUDA <<<>>>)
+extern "C" void call_kernel(


I am not a big fan of call_kernel name, especially when it becomes an extern "C" name. In all our kernels we use a descriptive name, e.g. in this case something like chunk_cumsum_fp32

asobczyk · 2026-04-21T11:42:23Z

+      batch_size, seq_len, total_tokens, ffts_addr);
+}
+
+extern "C" void call_kernel(


suggestion for name change: chunk_h_fp16

call_kernel_chunk_h_fp16 ?

asobczyk · 2026-04-21T11:52:32Z

+      batch_size, seq_len, total_tokens, ffts_addr);
+}
+
+// ── Host-side launcher ────────────────────────────────────────────────


might be better to use doxy-style docstring

asobczyk · 2026-04-21T11:53:28Z

+if _HERE not in sys.path:
+    sys.path.insert(0, _HERE)
+
+import numpy as np


ruff complains, just ensure to apply pre-commit to silence those warnings

asobczyk · 2026-04-21T11:55:39Z

+
+
+if __name__ == "__main__":
+    main()


these are very useful files, we should eventually adapt them as unit tests under tests/

asobczyk · 2026-04-21T11:59:05Z

@@ -0,0 +1,111 @@
+#include <pto/pto-inst.hpp>


the name of this folder is _old. If it is old and deprecated maybe we can remove it completely?

the name of this folder is _old. If it is old and deprecated maybe we can remove it completely?

Yes, I am not intended to merge this PR to main, but should instead extract useful pieces out as cleaner PRs.

asobczyk · 2026-04-21T12:02:33Z

@@ -0,0 +1,145 @@
+#!/usr/bin/env python3
+"""
+Benchmark mega-kernel vs aggregated per-stage PTO kernels.


would be helpful to write 1-2 sentences what is being benchmarked (brief overview)

gioelegott

Tested the mega-kernel and all tests pass

gioelegott · 2026-04-21T14:50:34Z

+def run_mega_kernel(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    g_in: torch.Tensor,
+    beta: torch.Tensor,
+    cu_seqlens: torch.Tensor,
+    *,
+    chunk_size: int = 128,
+    scale: float = 1.0,
+    block_dim: int | None = None,
+) -> torch.Tensor:


The interface is somewhat different from sgl-kernel-npu, but still compatible:

def chunk_gated_delta_rule_npu( q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, g: torch.Tensor, beta: torch.Tensor, scale: float = None, initial_state: torch.Tensor = None, output_final_state: bool = True, cu_seqlens: Optional[torch.LongTensor] = None, head_first: bool = False, use_qk_l2norm_in_kernel: bool = False, ):

* wip * push cpp code * use backend='pto' * uni test varlen * dump varlen source code with head 32 and 48 variants * fix comment * standalone PTO demo ported from tilelang --------- Co-authored-by: Anastasios Zouzias <anastasios.zouzias@huawei.com> Co-authored-by: learning-chip <jiawei.zhuang@outlook.com>

learning-chip changed the title ~~Chunk gdn~~ Complete chunkwise GatedDeltaNet Apr 7, 2026

learning-chip force-pushed the chunk_gdn branch from 84cd259 to ec2840e Compare April 8, 2026 07:55

learning-chip force-pushed the linear_attn branch from 382153e to f2a58d5 Compare April 14, 2026 15:35

learning-chip force-pushed the chunk_gdn branch from adc47f9 to 1e070af Compare April 14, 2026 15:35

learning-chip force-pushed the linear_attn branch from f2a58d5 to 5bef62d Compare April 15, 2026 08:05

learning-chip and others added 18 commits April 15, 2026 08:05

static chunk GDN source from tilelang

07eca0c

standalone static chunk GDN

19ca186

chain all kernels together to test e2e GDN

7c05517

use unified PTO_LIB_PATH

01975ec

BSND varlen version of chunk_cumsum

0bffb48

partial porting of dynamic chunk_h, wy_fast, kkt

1c086d1

partial working dynamic chunk_h

3223aa3

finish chunk_o part

e6a2734

fix scaled_dot_kkt functionality without hybriding torch hlpers

d49e3cc

merge kkt into one kernel launch

4373800

checkpointing todo items and lessons

647bbf1

attempt to debug wy_fast

1fdc466

wy fast now works correctly

71aaa29

finish chunk_h and update notes

5225a90

add skill template for general NPU kernel dev

b9dfefb

fix typo in skill

0bea68b

rewrite Mandatory requirements

7d3118b

mark highly recommended practices

dac9940

learning-chip force-pushed the chunk_gdn branch from 1e070af to dac9940 Compare April 15, 2026 08:05

learning-chip added 5 commits April 15, 2026 08:55

performance measurement of static tilelang reference

4c9b11d

update static_baseline shape and benchmark result

6433a7b

update triton reference benchmark numbers

8828096

rename dynamic bsnd dir

26bfcf0

Finish varlen BSND version of chunk GDN close to triton/tilelang perf

22bae35

learning-chip marked this pull request as ready for review April 16, 2026 07:31

learning-chip added 11 commits April 20, 2026 14:39

add torch emulation for triton bsnd varlen algorithm

1759c86

less conversion back and forth with numpy

1f35306

test more shape combinations in torch emulation

63a08ce

handle tail chunks in torch emulation

3fb3ad4

denser comments for torch emulation

92fc0f3

tri_inv_rec_unroll now supports low-triangular layout directly, to in…

48c198c

…terface other stages without transpose

remove unused _transpose_valid_chunk function in e2e chained test

f9f947e

correctly calculate perf summary table

6a68912

note on reusing npu stream

53eae71

finish GDN megakernel impl, test, and benchmark

419e0b2

rename torch emulation dir

cc07a1b

asobczyk requested changes Apr 21, 2026

View reviewed changes

learning-chip added 3 commits April 21, 2026 14:06

torch emulation of pto kernel dataflow

21cf836

explicitly emulate C-V data passing via workspace

cd13f74

more comments on index/offset calculations

693e767

gioelegott reviewed Apr 21, 2026

View reviewed changes

learning-chip added 3 commits April 21, 2026 14:55

more unified emulation APIs

23711f8

fix printed nan

f2be42d

avoid expensive sync and stream query inside kernel call

343fd95

learning-chip mentioned this pull request Apr 25, 2026

Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records #88

Open

17 tasks

zouzias and others added 9 commits April 28, 2026 18:07

finish grouped_value version of chunk_h kernel

4d02825

porting learnings

69d795c

chunk_o supports grouped heads

3257292

wy_fast support group heads

2f04341

scaled_dot_kkt now supports group head

79b3d4e

consolidate verify and benchmark scripts

d399b7f

verify e2e chained groupvalue kernels

069f4ec

add Megakernel for groupvalue shape

49fab3b

learning-chip mentioned this pull request May 3, 2026

Checklist for public release huawei-csl/megagdn-pto#2

Open

15 tasks

Conversation

learning-chip commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Accuracy evaluation

Feature list

Uh oh!

asobczyk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learning-chip Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gioelegott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

learning-chip commented Apr 7, 2026 •

edited

Loading

learning-chip Apr 21, 2026 •

edited

Loading