-
Notifications
You must be signed in to change notification settings - Fork 314
Add sub 1bit streamk gemm #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add sub 1bit streamk gemm #609
Conversation
|
@awni Could you have a look at it ? Thanks ! |
|
I'm a bit confused about this. Is there a specific model it should be used with? Or what is the intended usage? |
Yes there is a paper working on this topic and later codebook lut kernel will be added in metal platform. https://openreview.net/pdf?id=yBDBCpEzsO Just realized that apple's terminology of grid is different from CUDA grid. Ref : [1] https://www.shashankshekhar.com/blog/apple-metal-vs-nvidia-cuda |
|
Hi @awni it is ready to be merged. Optimization will be done in parallel and I am very happy if there are any inputs from you. Thank you! |
|
Hi @yiakwy-xpu-ml-framework-team appreciate the contribution, but I don't think it makes sense to merge this. We'd need some evidence that this is useful: is there a corresponding model? Is it fast? Does it work? etc. |
I guess the model should be provided, let me check.
This is the follow up of 1.58 bit model in #219, where end2end performance is evaluted. As for kernel part, I believe the the file self-explained, as an extension to https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/models/bitlinear_layers.py by Blaizzy @Blaizzy Benefits form sub 1 bits:
Here we make it better, by switching m loop into inner loop , values obtained from shifting is cached and reused. More techniques will be added soon
One of drawback of bitnet is that it is based on dense model, pervious work (gpt-oss mxfp4) has proven that large MoE models with MoE FFN layers are resistant to quantization errors. When adapting this change to low-bit MoE , it will be very useful in model size reduction (for large MoE). |
Description
This is from the latest paper where pervious 1-bit is packed with 2 bits, we prove sub-1 bit (0.7 ~ 0.8 bit) can also generate very good results and only 1 bit needed to pack weights.
Config:
Mlx : 0.29.3
Pytorch : 2.9
System :
Test
Simple Case
We found apple's fp16 AccT has some precision problems hence fallback to fp32 datatype for accFrag.
Complex Case
TO DO LIST
[] Double Buffer
[] WASP
[] Prefetch
[] Add a simple tunner
[x] streamk :
[x] splitk with atomic_fetch_add
[] SIMD_SUM_ACC, SIMD_ADD_MULTIPLY_ACC (to improve flops/bytes ratio)
[x] Faster unpack
[] Faster SIMD unpack
[] Memory interleave preprocessing
[x] vector load (64 bit)
[WIP] generalized N banks M (guess, 32 banks, 4 byte per bank) bit (64 bits vectorization size for Apple platform) swizzle strategy, e.g. BLOCK_SIZE_K=128*16 bits data, for 4 x 8 threads group configure 32x128 / (8x64) = 8 rows, at least BLOCK_SIZE_M=16 exists will gives 2-way conficts.
Apple Metal Performance Ablation Study
Pending