compare host vs device-side chunk metadata computation by learning-chip · Pull Request #84 · huawei-csl/pto-kernels

learning-chip · 2026-04-01T22:32:02Z

See file metadata_overhead.md for summary. Using host CPU is not faster, thus not putting into #79

learning-chip · 2026-04-02T14:14:37Z


 ### Layout conventions

+In general, the input to the `fast_inverse` kernels is a set of `D × D` sized triangular matrices. Depending on how these matrices are stored in memory, we might have `contiguous` layout, or the so-called `BSND` layout. The main input is a batch of sequences, and each sequence is then split in "chunks" of length `chunk_size`. This `chunk_size` is the same as the matrix size `D`.


In FLA convention D is the hidden_dim, and chunk_size should probably be named C....

learning-chip and others added 14 commits March 31, 2026 13:24

standalone fast inverse for quick hacking

c9790d2

also test bsnd branch

06b6a29

support varlen version of bsnd inverse

cf961b0

add bandwidth benchmark for varlen inverse kernel

0a61a72

paritial load/store in kernel to avoid slow torch padding

04c33df

fix kernel synchornization for large-size benchmarks

dc1c5f4

compute chunk_metadata from cu_seqlens

a43f974

add gitignore for fast_inverse example

496e6c4

compute chunk metadata inside NPU kernel using scalar core unit

16496b1

unit test mirror FLA triton repo

87d0a14

also change benchmark script to use triton-like input preparation

bd54017

compare host vs device-side chunk metadata computation

de06b4a

use prefix trick to speed-up computation

aa9c099

Add description for BSND

8da6f13

learning-chip commented Apr 2, 2026

View reviewed changes

Base automatically changed from inverse_varlen to main April 14, 2026 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare host vs device-side chunk metadata computation#84

compare host vs device-side chunk metadata computation#84
learning-chip wants to merge 14 commits intomainfrom
inverse_varlen_hostmeta

learning-chip commented Apr 1, 2026 •

edited

Loading

Uh oh!

learning-chip Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Layout conventions

		In general, the input to the `fast_inverse` kernels is a set of `D × D` sized triangular matrices. Depending on how these matrices are stored in memory, we might have `contiguous` layout, or the so-called `BSND` layout. The main input is a batch of sequences, and each sequence is then split in "chunks" of length `chunk_size`. This `chunk_size` is the same as the matrix size `D`.

Conversation

learning-chip commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

learning-chip Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

learning-chip commented Apr 1, 2026 •

edited

Loading