Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records#88
Open
learning-chip wants to merge 31 commits intomainfrom
Open
Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records#88learning-chip wants to merge 31 commits intomainfrom
learning-chip wants to merge 31 commits intomainfrom
Conversation
| sys.path.insert(0, str(COMMON_DIR)) | ||
|
|
||
| import torch | ||
| import torch_npu # noqa: F401 |
| from statistics import median | ||
|
|
||
| import torch | ||
| import torch_npu # noqa: F401 |
|
|
||
| from functools import lru_cache | ||
|
|
||
| from jit_shared import BLOCK_DIM, OPTIMIZED_KERNEL_FLAGS, compile_cpp as shared_compile_cpp |
| from functools import lru_cache | ||
|
|
||
| from jit_shared import BLOCK_DIM, OPTIMIZED_KERNEL_FLAGS, compile_cpp as shared_compile_cpp | ||
| from jit_shared import get_causal_mask, load_dynamic_mask_lib |
| @@ -0,0 +1,259 @@ | |||
| import argparse | |||
| import importlib.util | |||
| import os | |||
| from statistics import median | ||
|
|
||
| import torch | ||
| import torch_npu # noqa: F401 |
| from typing import Optional | ||
|
|
||
| import torch | ||
| import torch_npu # noqa: F401 |
This was referenced Apr 7, 2026
382153e to
f2a58d5
Compare
|
|
||
| from functools import lru_cache | ||
|
|
||
| from jit_shared import BLOCK_DIM, compile_cpp as shared_compile_cpp |
…triton kernel feature list
f2a58d5 to
5bef62d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR This is a much (3~5x) faster version of triton
chunk_oin vllm-ascend.Note on the chosen algorithm: it mirrors the
chunk_fwd_opart ofchunk_gated_delta_rule_fwdused in vllm-ascend's Qwen3.5 prefill. The "chunk_o" part is as simple as Gated Linear Attention (same as Mamba-2). This PR does not consider the "delta rule" part yet, which is only shown in "chunk_h" phase.Step-by-step perf optimizations
Added
optimize_step_by_stepdirectory to reproduce the above step-by-step performance gains, but at latest commit.Feature list:
triton_baseline/performance_summary.md, PTO version is 3~4x faster)Compiles and runs correctly using the pto-isa headers in
/usr/local/Ascend/cann-8.5.1/include(CANN version inquay.io/ascend/vllm-ascend:v0.18.0rc1package). GitCode tag 8.5.0 also compiles and runs fine.Remaining issues for this PR
#include <pto/common/pto_tile.hpp>to the kernel sources, and change the use ofTTRI. See commit -- 382153eMinor issues:
To avoid slow mask construction using scalar core loops, currently the causal mask is precomputed in passed-in as extra arg.
in vllm-ascend the on-SRAM mask is built by:
Remaining issues for future PRs
On algorithm side:
On C++ framework side:
On Python DSL framework side: