[New feature] integrate causal_conv1d Triton kernel for Ascend NPU #228
Conversation
Add self-contained causal_conv1d kernel module (no mindspeed_ops dependency) with full Triton forward/backward implementations adapted from MindSpeed-Ops. Patch monkey_patch_npu to bind npu_causal_conv1d_fn on NPU-patched modules, remove torch fallback in linear_attention_sp, and add NPU-aware causal_conv1d wrapper in gdn_padding_free (no transpose needed, [B,T,D] native format).
There was a problem hiding this comment.
Code Review
This pull request introduces a self-contained, NPU-accelerated causal_conv1d Triton kernel module to support Ascend NPUs, integrating it into the monkey patching, sequence parallel, and padding-free GDN mechanisms. The code review identified several critical issues and bugs: a missing HAS_WEIGHT guard in the backward kernel when storing dw, a shape mismatch and argument-dropping bug in the NPU wrapper within gdn_padding_free.py, compatibility issues with smaller feature dimensions due to a hardcoded block size (BD = 256), and an ignored bias parameter in the forward update kernel. Additionally, an optimization was suggested for _prepare_chunk_indices to avoid host-device synchronization.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
PR type
PR information
integrate causal_conv1d Triton kernel for Ascend NPU
Experiment results
Model: Qwen3.5-4B
Hardware: Atlas 900 A3 (2 x NPU)
Dataset: GSM8K_ZH
Finetuning type: LoRA
Software: cann9.0.0+ torch/orch_npu 2.9.0 + MindSpeed 0.12.1 + triton-ascend 3.2.1 + transformers 5.9
related: https://gitcode.com/Ascend/MindSpeed-Ops