Skip to content

compare host vs device-side chunk metadata computation#84

Draft
learning-chip wants to merge 14 commits intomainfrom
inverse_varlen_hostmeta
Draft

compare host vs device-side chunk metadata computation#84
learning-chip wants to merge 14 commits intomainfrom
inverse_varlen_hostmeta

Conversation

@learning-chip
Copy link
Copy Markdown
Collaborator

@learning-chip learning-chip commented Apr 1, 2026

See file metadata_overhead.md for summary. Using host CPU is not faster, thus not putting into #79

bench_results_bsnd_fast_inverse_bw_128 bench_results_bsnd_fast_inverse_bw_64


### Layout conventions

In general, the input to the `fast_inverse` kernels is a set of `D × D` sized triangular matrices. Depending on how these matrices are stored in memory, we might have `contiguous` layout, or the so-called `BSND` layout. The main input is a batch of sequences, and each sequence is then split in "chunks" of length `chunk_size`. This `chunk_size` is the same as the matrix size `D`.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FLA convention D is the hidden_dim, and chunk_size should probably be named C....

Base automatically changed from inverse_varlen to main April 14, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants