Add sub 1bit streamk gemm #609

yiakwy-xpu-ml-framework-team · 2025-11-13T11:17:29Z

Description

This is from the latest paper where pervious 1-bit is packed with 2 bits, we prove sub-1 bit (0.7 ~ 0.8 bit) can also generate very good results and only 1 bit needed to pack weights.

Config:

Mlx : 0.29.3
Pytorch : 2.9
System :

M4 Pro (20 metal 3 cores + 16 ANE cores),
M3 Ultra (80 metal 3 cores + 32 ANE cores),

Test

Simple Case

We found apple's fp16 AccT has some precision problems hence fallback to fp32 datatype for accFrag.

Complex Case

Llama3 classic layout MxKxN = 4096 x 16384 x 4096
- baseline :
  - M4 Pro :
- metal kernel vectorized loads
  - M4 Pro :
  - M3 Ultra (thanks the support from Apple team) :

TO DO LIST

[] Double Buffer
[] WASP
[] Prefetch
[] Add a simple tunner
[x] streamk :
[x] splitk with atomic_fetch_add
[] SIMD_SUM_ACC, SIMD_ADD_MULTIPLY_ACC (to improve flops/bytes ratio)
[x] Faster unpack
[] Faster SIMD unpack
[] Memory interleave preprocessing
[x] vector load (64 bit)
[WIP] generalized N banks M (guess, 32 banks, 4 byte per bank) bit (64 bits vectorization size for Apple platform) swizzle strategy, e.g. BLOCK_SIZE_K=128*16 bits data, for 4 x 8 threads group configure 32x128 / (8x64) = 8 rows, at least BLOCK_SIZE_M=16 exists will gives 2-way conficts.

Apple Metal Performance Ablation Study

Pending

yiakwy-xpu-ml-framework-team · 2025-11-13T11:18:09Z

@awni Could you have a look at it ? Thanks !

awni · 2025-11-13T14:16:58Z

I'm a bit confused about this. Is there a specific model it should be used with? Or what is the intended usage?

yiakwy-xpu-ml-framework-team · 2025-11-14T04:15:30Z

I'm a bit confused about this. Is there a specific model it should be used with? Or what is the intended usage?

Yes there is a paper working on this topic and later codebook lut kernel will be added in metal platform.

https://openreview.net/pdf?id=yBDBCpEzsO

Just realized that apple's terminology of grid is different from CUDA grid.

Ref :

[1] https://www.shashankshekhar.com/blog/apple-metal-vs-nvidia-cuda
[2] https://ml-explore.github.io/mlx/build/html/dev/custom_metal_kernels.html#custom-metal-kernels

yiakwy-xpu-ml-framework-team · 2025-11-15T14:30:41Z

Hi @awni it is ready to be merged.

Optimization will be done in parallel and I am very happy if there are any inputs from you. Thank you!

awni · 2025-11-15T14:35:07Z

Hi @yiakwy-xpu-ml-framework-team appreciate the contribution, but I don't think it makes sense to merge this. We'd need some evidence that this is useful: is there a corresponding model? Is it fast? Does it work? etc.

yiakwy-xpu-ml-framework-team · 2025-11-16T04:21:59Z

@awni

is there a corresponding model? Does it work?

I guess the model should be provided, let me check.

Is it fast?

This is the follow up of 1.58 bit model in #219, where end2end performance is evaluted.

As for kernel part, I believe the the file self-explained, as an extension to https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/models/bitlinear_layers.py by Blaizzy @Blaizzy

Benefits form sub 1 bits:

Bitnet (1.58 bit) actually uses 2 bits for each weights, while sub-1-bit models uses average less 1 bit per weight , hence less of memory IO from global to sram on chip
We have extensively benchmarked in various platform, how 1 bit should be accelerated since including 4 bits (marlin, bitnet) they are very hard to beat SOTA fp16 matmul due to extensive shift operations.

Here we make it better, by switching m loop into inner loop , values obtained from shifting is cached and reused.

More techniques will be added soon

Streamk added : various old implementation are built on split-k variants, especially in metal platform. Perhaps this is the first stream k low bit gemm (or even the first stream k gemm) in metal platform ? With streamk, workloads are more evenly distributed among 80 Gpu cores in M3 Ultra platfrom. This is a practical contribution.
Reducing model size can be orthogonal to accelerating inference. For example MoE has large model size, but only few of the parameters are activated; with model size reduction, gpt-oss-120b-mxfp4 alike models are affordable running in a single H100 GPU (H100x8 80 GB DGX).

One of drawback of bitnet is that it is based on dense model, pervious work (gpt-oss mxfp4) has proven that large MoE models with MoE FFN layers are resistant to quantization errors.

When adapting this change to low-bit MoE , it will be very useful in model size reduction (for large MoE).

yiakwy-xpu-ml-framework-team added 2 commits November 13, 2025 18:52

add sub_1bit_streamk_gemm

aeda93f

add device query info

d0bc600

yiakwy-xpu-ml-framework-team marked this pull request as draft November 13, 2025 11:17

yiakwy-xpu-ml-framework-team mentioned this pull request Nov 14, 2025

[Metal Backend] Metal create column major tensor pytorch/pytorch#167794

Closed

yiakwy-xpu-ml-framework-team added 4 commits November 14, 2025 20:43

add warp fragment test

7e1a278

update boundary check

7755f19

add k-split-stream kernel

289a410

add benchmark test after verifying correctness

4fa5925

yiakwy-xpu-ml-framework-team marked this pull request as ready for review November 15, 2025 09:13

yiakwy-xpu-ml-framework-team added 2 commits November 15, 2025 18:46

add faster unpack operation for Metal Platform

5569544

add profile

f2e59d8

yiakwy-xpu-ml-framework-team added 2 commits November 16, 2025 14:15

add half4 vector load

7694a73

unroll onchip load after vectorized load from global

19587c6

yiakwy-xpu-ml-framework-team mentioned this pull request Nov 24, 2025

[Feature Request] Radix tree kernel for Agentic LLM #630

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sub 1bit streamk gemm #609

Add sub 1bit streamk gemm #609

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 13, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 13, 2025

Uh oh!

awni commented Nov 13, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 14, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 15, 2025

Uh oh!

awni commented Nov 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add sub 1bit streamk gemm #609

Are you sure you want to change the base?

Add sub 1bit streamk gemm #609

Uh oh!

Conversation

yiakwy-xpu-ml-framework-team commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test

Simple Case

Complex Case

TO DO LIST

Apple Metal Performance Ablation Study

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 13, 2025

Uh oh!

awni commented Nov 13, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 15, 2025

Uh oh!

awni commented Nov 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yiakwy-xpu-ml-framework-team commented Nov 13, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Nov 14, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Nov 16, 2025 •

edited

Loading