Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
07eca0c
static chunk GDN source from tilelang
learning-chip Apr 7, 2026
19ca186
standalone static chunk GDN
learning-chip Apr 7, 2026
7c05517
chain all kernels together to test e2e GDN
learning-chip Apr 7, 2026
01975ec
use unified PTO_LIB_PATH
learning-chip Apr 8, 2026
0bffb48
BSND varlen version of chunk_cumsum
learning-chip Apr 8, 2026
1c086d1
partial porting of dynamic chunk_h, wy_fast, kkt
learning-chip Apr 8, 2026
3223aa3
partial working dynamic chunk_h
learning-chip Apr 8, 2026
e6a2734
finish chunk_o part
learning-chip Apr 8, 2026
d49e3cc
fix scaled_dot_kkt functionality without hybriding torch hlpers
learning-chip Apr 8, 2026
4373800
merge kkt into one kernel launch
learning-chip Apr 8, 2026
647bbf1
checkpointing todo items and lessons
learning-chip Apr 8, 2026
1fdc466
attempt to debug wy_fast
learning-chip Apr 8, 2026
71aaa29
wy fast now works correctly
learning-chip Apr 8, 2026
5225a90
finish chunk_h and update notes
learning-chip Apr 9, 2026
b9dfefb
add skill template for general NPU kernel dev
Apr 13, 2026
0bea68b
fix typo in skill
Apr 13, 2026
7d3118b
rewrite Mandatory requirements
Apr 13, 2026
dac9940
mark highly recommended practices
Apr 13, 2026
4c9b11d
performance measurement of static tilelang reference
learning-chip Apr 15, 2026
6433a7b
update static_baseline shape and benchmark result
learning-chip Apr 15, 2026
8828096
update triton reference benchmark numbers
learning-chip Apr 15, 2026
26bfcf0
rename dynamic bsnd dir
learning-chip Apr 15, 2026
22bae35
Finish varlen BSND version of chunk GDN close to triton/tilelang perf
learning-chip Apr 16, 2026
437cac4
minor cleanup
learning-chip Apr 16, 2026
044fbd0
update skills about random or dead-lock errors
learning-chip Apr 16, 2026
d226c20
longer timeout suggestion
learning-chip Apr 16, 2026
14b4d92
minor fix
learning-chip Apr 16, 2026
46d6b63
fix indeterminisic sync error
learning-chip Apr 16, 2026
1dacf28
fix indeterminisic sync error for chunk_o at large shape
learning-chip Apr 16, 2026
7318f2e
note on block_dim choice in skills
learning-chip Apr 16, 2026
f88509b
update perf numbers after fixing sync
learning-chip Apr 16, 2026
a9020a4
Update skills.md with NPU id selection advice
learning-chip Apr 16, 2026
d3aa7e1
Optimize performance for kkt and chunk_o
learning-chip Apr 16, 2026
0f68f38
update torch ref to mirro chunkwise algorithm, to reduce error threshold
learning-chip Apr 16, 2026
e580b77
checkpoint the lessons learned and todo list
learning-chip Apr 16, 2026
342e05b
deeper performance optization that beat triton by 2x
learning-chip Apr 16, 2026
cf7471d
update performance lesson
learning-chip Apr 16, 2026
eea2f04
Fix vec - mte3 sync notes
learning-chip Apr 16, 2026
e1df2cd
inline common.h, and put educational comments
learning-chip Apr 16, 2026
0ea341b
denser, educational code comments
learning-chip Apr 17, 2026
147ebf4
test more shape combination for dynamic bsnd
learning-chip Apr 17, 2026
3fec747
fix crashing non-aligned seq boundary test case
learning-chip Apr 17, 2026
6a2da82
add numerical check notes to skills
learning-chip Apr 19, 2026
fb60462
More carefully check numerical error distribution
learning-chip Apr 19, 2026
1cccae0
Fix typo in error threshold documentation
learning-chip Apr 20, 2026
7434b6c
fixed numerical error for wy_w and chunk_o, now all stages pass stric…
learning-chip Apr 20, 2026
2f973bf
more longer shapes in e2e accuracy eval
learning-chip Apr 20, 2026
1759c86
add torch emulation for triton bsnd varlen algorithm
learning-chip Apr 20, 2026
1f35306
less conversion back and forth with numpy
learning-chip Apr 20, 2026
63a08ce
test more shape combinations in torch emulation
learning-chip Apr 20, 2026
3fb3ad4
handle tail chunks in torch emulation
learning-chip Apr 20, 2026
92fc0f3
denser comments for torch emulation
learning-chip Apr 20, 2026
48c198c
tri_inv_rec_unroll now supports low-triangular layout directly, to in…
learning-chip Apr 20, 2026
f9f947e
remove unused _transpose_valid_chunk function in e2e chained test
learning-chip Apr 20, 2026
6a68912
correctly calculate perf summary table
learning-chip Apr 20, 2026
53eae71
note on reusing npu stream
learning-chip Apr 20, 2026
419e0b2
finish GDN megakernel impl, test, and benchmark
learning-chip Apr 21, 2026
cc07a1b
rename torch emulation dir
learning-chip Apr 21, 2026
21cf836
torch emulation of pto kernel dataflow
learning-chip Apr 21, 2026
cd13f74
explicitly emulate C-V data passing via workspace
learning-chip Apr 21, 2026
693e767
more comments on index/offset calculations
learning-chip Apr 21, 2026
23711f8
more unified emulation APIs
learning-chip Apr 21, 2026
f2be42d
fix printed nan
learning-chip Apr 21, 2026
343fd95
avoid expensive sync and stream query inside kernel call
learning-chip Apr 25, 2026
7c29bcf
Test new tilelang varlen kernel (#138)
zouzias Apr 28, 2026
4d02825
finish grouped_value version of chunk_h kernel
learning-chip Apr 28, 2026
69d795c
porting learnings
learning-chip Apr 28, 2026
3257292
chunk_o supports grouped heads
learning-chip Apr 28, 2026
2f04341
wy_fast support group heads
learning-chip Apr 28, 2026
79b3d4e
scaled_dot_kkt now supports group head
learning-chip Apr 28, 2026
d399b7f
consolidate verify and benchmark scripts
learning-chip Apr 28, 2026
069f4ec
verify e2e chained groupvalue kernels
learning-chip Apr 28, 2026
49fab3b
add Megakernel for groupvalue shape
learning-chip Apr 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions .skills/npu_kernel_general/skills.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# General knowledge about writing, compiling, and executing kernels on NPU


## Mandatory requirements for NPU kernel tasks

These rules apply whenever you (the agent) **develop, port, or optimize NPU kernels**. They are **not optional** guidance.

**Definition of done (all are required):**

1. **Compile** the kernel with `bisheng`, following the patterns in `examples/jit_cpp` in this repo.
2. **Execute** it on a real NPU via torch-npu (PyTorch with `device="npu"`).
3. **Verify** numerical correctness against a PyTorch or NumPy reference.

Until all three succeed, the task is **not finished**. Do not treat "code written" or "compiles only" as completion.

**You MUST:**

- Run the compile and NPU execution yourself and fix compile errors, runtime errors, and test failures by iterating until the kernel and its test scripts pass.
- Record the exact reproducing commands in that subdirectory’s `README.md` when the work is done so the user can re-run and confirm.

**You MUST NOT:**

- Ask the user to manually compile, run, or verify your new, still-untested code as a substitute for doing it yourself.

The environment is assumed capable of compiling and running on NPU; lack of access is not a reason to skip the steps above—surface the failure and what blocked you instead of delegating execution to the user.

---

## Highly recommended practices

> **Highly recommended — not mandatory:** The subsections below are **strong default guidance** for NPU kernels (resources, PTO-ISA layout, buffer limits, core topology, synchronization, performance, and timing). They are **not** part of the mandatory definition of done in **Mandatory requirements for NPU kernel tasks**; follow them when they apply unless you have a documented reason to diverge.

### Pick free NPUs for execution

`npu-smi info` prints NPU availability like:

```
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B2 | OK | 103.6 50 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3441 / 65536 |
+===========================+===============+====================================================+
...
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
...
```

Pick an NPU id with "No running processes", and avoid NPU id with other processes running on, to avoid resource contention. For example, to switch to NPU id 7, set `torch.npu.set_device("npu:7")` at the very beginning of the Python test script.

When all NPUs are free, prefer the later ids such as one of `npu:4` `npu:5` `npu:6` `npu:7`, because they are more likely to be free of resource contention. Avoid heavy use of `npu:0` as many other users will use it by default.

### Find pto-isa doc, implementation, and unit tests

The kernels should be implemented using APIs in "PTO-ISA" C++ library, just like other existing kernel samples under `examples/jit_cpp` or `csrc/kernel` of this repo.

The "PTO-ISA" library source code is usually located in `/workdir/pto-isa-master` or `/sources/pto-isa` path. Prompt the user to check if those directories do not exist in your environment. The most important subdirectories under `pto-isa` / `pto-isa-master` are:
- ISA documentation: `docs/isa`
- C++ header implementation: `include/pto/npu/a2a3`
- Unit tests: `tests/npu/a2a3/src/st/testcase`

(the `a2a3` subdirectory name refers to current `910B` hardware; future `950` hardware uses `a5` subdirectory)


### Plan buffer space usage

`Tile` variables live in local SRAM buffer, with limited size.

The hardware spec can be queried by command `grep -A 20 "AICoreSpec" ${ASCEND_HOME_PATH}/arm64-linux/data/platform_config/Ascend910B2.ini`, which gives:

```bash
[AICoreSpec]
cube_freq=1800
cube_m_size=16
cube_n_size=16
cube_k_size=16
vec_calc_size=128
l0_a_size=65536
l0_b_size=65536
l0_c_size=131072
l1_size=524288
fb0_size=2048
fb1_size=1024
fb2_size=2048
fb3_size=2048
bt_size=1024
smask_buffer=0
ub_size=196608
ubblock_size=32
ubbank_size=4096
ubbank_num=64
ubburst_in_one_block=32
```

The most important pieces of information are:
- ub_size=192 KiB, for `Tile<TileType::Vec, ...>`
- l1_size=512 KiB, for `Tile<TileType::Mat, ...>`
- l0_a_size=l0_b_size=64 KiB, for `TileLeft` and `TileRight`
- l0_c_size=128 KiB, for `TileAcc`

Make effective use of those SRAM buffers. Too little usage leads to low hardware utilization, while too much usage leads to overflow error.

### Number of Cube and Vector cores

The `910B2` hardware contains 24 "Cube cores" for matrix multiplications, and 48 "Vector cores" for all the rest of vector operations.

Confirm by command `grep -A 8 "SoCInfo" ${ASCEND_HOME_PATH}/arm64-linux/data/platform_config/Ascend910B2.ini`:

```
[SoCInfo]
ai_core_cnt=24
cube_core_cnt=24
vector_core_cnt=48
ai_cpu_cnt=6
memory_type=
memory_size=68719476736
l2_type=0
l2_size=201326592
```

For complex "mix" kernels that use both Cube cores and Vector cores, one cube core is coordinated with two vector cores. `get_block_idx()` gives the logical id of Cube cores, while Vector core id is usually given by `const uint32_t vid = get_block_idx() * get_subblockdim() + get_subblockid();`

For the `block_dim` parameter needed by kernel launch `<<< >>>`, set it to the number of cores like `BLOCK_DIM = int(getattr(torch.npu.get_device_properties("npu:0"), "cube_core_num", 20))`, such that one "block" is binded to one physical core. Avoid a large data-size-dependent `block_dim` like normal CUDA kernels. For NPU kernels, the kernel launch is similar to a "persistent kernel" in CUDA/triton that uses `block_dim=num_cores` and manually loops over the dynamic-sized input data side the kernel using for loops.


### Synchronization for concurrent executions

Data movement instructions (e.g. `TLOAD`/`TSTORE`/`TMOV`) and compute instructions (e.g. `TADD`, `TMATMUL`) are asynchronous. To avoid data hazards during software pipelining, need `SetFlag` & `WaitFlag` instructions in between. Check existing kernel samples under `examples/jit_cpp` or `csrc/kernel` of this repo for typical synchronization patterns.

Insufficient synchronization can lead to **indeterministic bugs** that are hard to locate. Typical error patterns:
- Same kernel sometimes deadlocks or crashes, sometimes runs through
- Same kernel sometimes passes numerical check, sometimes not.
Those are due the asynchronous nature of the execution units in hardware.

Good practices:
- Always run the same verification scripts 3~5 times, not just one time.
- Be prepared that a test script might hang -- time-out until waiting for 60~90 seconds, to avoid the agent session being stucked forever.


### Performance optimization practices

- Avoid heavy use of scalar computations + scalar for loops, as they use the very slow "Scalar core" in NPU. Use SIMD instructions like `TLOAD`, `TADD`.
- General rule of thumb: Use wide SIMD length, and use "double buffers" (with two sync event ids) to overlap compute with data movement.
- Check against ideal roofline peak. For `910B2` device, the hardware roofline is about 1.5 TB/sec for global memory bandwidth, and ~300 TFLOP/s for matmul FLOPs.
- A kernel with less than 10% of roofline is concerning: it might be bottlenecked by scalar cores, or uses wrong benchmark timer settings.
- A kernel that reaches much beyond roofline means not timing async kernel launch correctly, or has L2 cache reuse across iterations (if exceeds bandwidth peak but not FLOP peak).

### NPU benchmark timer settings and caveats

A typical timing code using `torch.npu.Event` (similar to `torch.cuda.Event`) looks like:

```python
for _ in range(repeats):
torch.npu.synchronize()
start = torch.npu.Event(enable_timing=True)
end = torch.npu.Event(enable_timing=True)
# can optionally clean L2 cache here
start.record()
custom_kernel_launch()
end.record()
end.synchronize()
samples_ms.append(start.elapsed_time(end))
```

In most cases `torch.npu.synchronize()` can be used for the `end.synchronize()` line. But triton kernel launches (sometimes needed for perf comparison) seem to not be synchronized with `torch.npu.synchronize()`, so here we use `end.synchronize()` instead.

Query `torch.npu.current_stream()._as_parameter_` is relatively expensive. Reuse the stream_ptr across timing loops.

### Choosing error threshold in numerical correctness check

Definitely avoid `atol=1e-2` in correctness checks. The values of intermediate activations are often on the magnitude of `1e-2`, thus passing asserts with `atol=1e-2` can mean 100% relative error, which is a meaningless check. Keep atol very small like `1e-5`. In comparison, `rtol=1e-2` is fine for bfloat16 dtype, ref [`torch.testing.assert_close` defaults](https://docs.pytorch.org/docs/main/testing.html#torch.testing.assert_close).

In case of few outliers that break `rtol`, can also check `rmse` vs average output magnitude (`rmse` should be 1~2 orders of magnitudes smaller than output values themselves). Also check R2 score between kernel output and reference output (should get R2=0.99 even with a few outliers).
Loading
Loading