-
Notifications
You must be signed in to change notification settings - Fork 314
Add ane op matmul #617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add ane op matmul #617
Conversation
|
my understanding is that the ANE is 5X faster, from benchmarks that run small compute against it. it's limited to very light data sets typically 1GB to 3GB total size. I can see an application where KV cache or something else small was processed/stored. please help clarify this for me. I can probably dig up the links where I found this original data specifying the limits of the ANE. |
In M3 Ultra (2 dies with total 80 gpu cores), metal gpu is faster than ANE. That is why MLX team won't support ANE. However we can still use to accelerate our computing. |
|
Compute is not faster on M3U gpu. Memory bandwidth is. |
|
28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U |
Hi Anemll I just noticed your work. Great job. Simply put, in our comprehensive benchmark (not publish yet), MPS can achieve 25 TFLOPS over the claimed 28 TFLOSP [1] . Note ANE is claimed to have 36 TOPS (integer) [1] , not TFLOPS. You can verify ANE's performance from an a16w16 gemm with this test script : ANE is 5 times slower than MPS. Our assumption of ANE is that it is a still capable computing unit, so we want to utilize it by dispatching main workload from MPS to ANE. ===== |
|
For M3U 36 is Fp16 due to dual cluster , for m4 max that is int8/4.
M3 was the last ANE without accelerated int8. To reach full 30 + tflops you
need batch 64 in the last tensor dimension.
…On Thu, Nov 20, 2025 at 7:31 PM Yiakwy ***@***.***> wrote:
*yiakwy-xpu-ml-framework-team* left a comment (ml-explore/mlx-lm#617)
<#617 (comment)>
28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U
Hi Anemll I just noticed your work. Simply put, in our comprehensive
benchmark (not publish yet) MPS can achieve 25 TFLOPS over the claimed 28
TFLOSP [1] . Note ANE is claimed to have 36 TOPS (integer) not TFLOPS.
You can verify ANE's performance a16w16 gemm performance with this test
script : it 5 times slower than MPS.
Notably , even in private LLM we prefer flops instead of integer since
integer quantization method cost more than flops and since integer
mutliplication is essentially polynomial multiplication and shift operation
is very expensive in circuites, float is more cheap than integer in
multiplication (as opposition to some what research paper claim, they
didn't include scaling operations).
[1] https://en.wikipedia.org/wiki/Apple_silicon
—
Reply to this email directly, view it on GitHub
<#617 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BOG4QS4BSXPJUUZBFLB3MHD352BRDAVCNFSM6AAAAACMKWPN2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNRRGE3TKNBRGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks . Do you have an example that can run with that output . Thx! |
Description
Both MLX and pytorch Metal backends do not utilize Apple ANE : a 32 cores capable computing unit.
Unlike NVIDIA GPU, unified memory means activation visible to GPU can be fetched to ANE ei ngine only requires copy from unified memory to ANE back and forth.
For M x N matrix , if we can dispatch k/M rows to ANE while GPU is fully occupied, we should expect to see some improvements of computing
Methodoloy
We first added a ANE calling engine, then we verified how ANE can work together with different backends.
To our best knowledge, ANE is almost 1/3 ~ 1/5 slower than Metal3 GPU in M3 Ultra:
And Pytorch MPS backend does not work with unified memory very well
Hence we are tuning around this range with MLX backend to get even better performance.