Skip to content

Complete chunkwise GatedDeltaNet#91

Open
learning-chip wants to merge 73 commits intolinear_attnfrom
chunk_gdn
Open

Complete chunkwise GatedDeltaNet#91
learning-chip wants to merge 73 commits intolinear_attnfrom
chunk_gdn

Conversation

@learning-chip
Copy link
Copy Markdown
Collaborator

@learning-chip learning-chip commented Apr 7, 2026

Finish all the rest part of #88 to support full Qwen3.5 GDN layer.

Reproduce: Compiles and runs with pto-isa commit on April 03. I used this modified vllm-ascend docker image, with triton-ascend pre-installed, so it's easier to compare against triton baseline in vllm.

Performance

Shape: (N_seq=16, L_seg=16384, H=16, DK=DV=128, C=128), packed varlen
BSND with T=262144.

Kernel PTO (ms) Triton (ms) Speedup TFLOPS
chunk_cumsum 0.34 1.02 3.00x 0.012
chunk_scaled_dot_kkt 2.78 4.84 1.74x 24.8
wy_fast 6.85 15.63 2.28x 20.1
chunk_h 9.43 30.83 3.27x 29.1
chunk_o 11.35 16.15 1.42x 30.3
total 30.75 68.47 2.23x 26.8

Reproduced by chunk_gdn/dynamic_bsnd vs chunk_gdn/triton_baseline

Accuracy evaluation

varlen_1,63,64,65,127,128,129,447,512,640,1920_long_ladder

Reproduced by chunk_gdn/pto_e2e_measure

Feature list

  • Basic BNSD static shape that passes e2e GDN unit test (See chunk_gdn/static_baseline/gdn_chain_e2e_static.py)
  • Support BSND varlen that matches triton kernel API used in vllm/sglang (See chunk_gdn/dynamic_bsnd)
  • Performance tuning (e.g. C-V pipelining, L1/L0 double buffering)
  • Performance comparison to Triton baseline
  • Merge into one single "megakernel" launch
  • Deploy into vllm-ascend and verify e2e -- tested small Qwen model in patch_vllm_pto
  • Support Grouped Value Attention where num_key_head < num_value_head (required by larger Qwen)

@learning-chip learning-chip changed the title Chunk gdn Complete chunkwise GatedDeltaNet Apr 7, 2026
@learning-chip learning-chip marked this pull request as ready for review April 16, 2026 07:31
Copy link
Copy Markdown
Collaborator

@asobczyk asobczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to take a look in about ~20 of the 113 files, I left some comments / suggestions.
In general it looks good, but i have some general remarks:

  1. Must-change: The changes under csrc/kernel/kernel_tri_inv_rec_unroll.cpp should be thoroughly examined and tested in a separate, isolated PR, with dedicated unit tests.
  2. Nice-to-have: I would avoid special characters in the source code files, such as arrows, " \mathbb{R} ", or greek letters. It is better to be consistent with the variable names that are used by the functions
  3. Nice-to-have: Doxygen-style docstrings are missing -- The current descriptions/docstrings could be translated to doxy-style
  4. Nice-to-have: Ideally, the main kernels that are used should be ported to csrc/kernels. One PR per kernel, with source code, torch integration, and unit tests. I know that this is a devious work so for now I do not mind if we do it in a separate PR

AICORE inline void CopyOddOrEvenBlocksL1ToL0(SrcL1TileT src, DstL0TileT dst,
uint32_t block_size) {
uint32_t block_size,
bool swap_parity = false) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in this file (in any csrc/ file) should go to a separate MR with unit tests

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also believe that if the swap_parity is only used for deciding between upper/lower triangular then we can implement it with a much more seemless way just by reading in row-major vs column major manner

// For left: copy even blocks 0, 2, 4, ... (starting_block=0)
// For right: copy odd blocks 1, 3, 5, ... (starting_block=1)
const uint32_t starting_block_index = is_left ? 0 : 1;
// Default: left→even(0), right→odd(1). swap_parity flips this.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to avoid special characters such as

@@ -0,0 +1,263 @@
#!/usr/bin/env python3
"""
Benchmark dynamic BSND PTO kernels (bisheng-compiled, ctypes) for chunk GDN.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to expand the description here about what is being benchmarked, and how (a birds-eye view)

// stream = NPU stream for async execution (like CUDA stream)
// rtGetC2cCtrlAddr: gets the FFTS control address for cross-core sync
// <<<block_dim, nullptr, stream>>>: NPU kernel launch syntax (like CUDA <<<>>>)
extern "C" void call_kernel(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of call_kernel name, especially when it becomes an extern "C" name. In all our kernels we use a descriptive name, e.g. in this case something like chunk_cumsum_fp32

batch_size, seq_len, total_tokens, ffts_addr);
}

extern "C" void call_kernel(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion for name change: chunk_h_fp16

Copy link
Copy Markdown
Collaborator Author

@learning-chip learning-chip Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call_kernel_chunk_h_fp16 ?

batch_size, seq_len, total_tokens, ffts_addr);
}

// ── Host-side launcher ────────────────────────────────────────────────
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to use doxy-style docstring

if _HERE not in sys.path:
sys.path.insert(0, _HERE)

import numpy as np
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruff complains, just ensure to apply pre-commit to silence those warnings



if __name__ == "__main__":
main()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are very useful files, we should eventually adapt them as unit tests under tests/

@@ -0,0 +1,111 @@
#include <pto/pto-inst.hpp>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of this folder is _old. If it is old and deprecated maybe we can remove it completely?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of this folder is _old. If it is old and deprecated maybe we can remove it completely?

Yes, I am not intended to merge this PR to main, but should instead extract useful pieces out as cleaner PRs.

@@ -0,0 +1,145 @@
#!/usr/bin/env python3
"""
Benchmark mega-kernel vs aggregated per-stage PTO kernels.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be helpful to write 1-2 sentences what is being benchmarked (brief overview)

Copy link
Copy Markdown
Collaborator

@gioelegott gioelegott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the mega-kernel and all tests pass

Comment on lines +150 to +161
def run_mega_kernel(
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
g_in: torch.Tensor,
beta: torch.Tensor,
cu_seqlens: torch.Tensor,
*,
chunk_size: int = 128,
scale: float = 1.0,
block_dim: int | None = None,
) -> torch.Tensor:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface is somewhat different from sgl-kernel-npu, but still compatible:

def chunk_gated_delta_rule_npu(
    q: torch.Tensor,
    k: torch.Tensor,
    v: torch.Tensor,
    g: torch.Tensor,
    beta: torch.Tensor,
    scale: float = None,
    initial_state: torch.Tensor = None,
    output_final_state: bool = True,
    cu_seqlens: Optional[torch.LongTensor] = None,
    head_first: bool = False,
    use_qk_l2norm_in_kernel: bool = False,
):

zouzias and others added 9 commits April 28, 2026 18:07
* wip

* push cpp code

* use backend='pto'

* uni test varlen

* dump varlen source code with head 32 and 48 variants

* fix comment

* standalone PTO demo ported from tilelang

---------

Co-authored-by: Anastasios Zouzias <anastasios.zouzias@huawei.com>
Co-authored-by: learning-chip <jiawei.zhuang@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants