Add ane op matmul #617

yiakwy-xpu-ml-framework-team · 2025-11-17T13:18:54Z

Description

Both MLX and pytorch Metal backends do not utilize Apple ANE : a 32 cores capable computing unit.

Unlike NVIDIA GPU, unified memory means activation visible to GPU can be fetched to ANE ei ngine only requires copy from unified memory to ANE back and forth.

For M x N matrix , if we can dispatch k/M rows to ANE while GPU is fully occupied, we should expect to see some improvements of computing

Methodoloy

We first added a ANE calling engine, then we verified how ANE can work together with different backends.

To our best knowledge, ANE is almost 1/3 ~ 1/5 slower than Metal3 GPU in M3 Ultra:

And Pytorch MPS backend does not work with unified memory very well

(tensor.data_ptr() != tensor.cpu().data_ptr())

Hence we are tuning around this range with MLX backend to get even better performance.

IntuitIntelLLC · 2025-11-18T05:25:11Z

my understanding is that the ANE is 5X faster, from benchmarks that run small compute against it. it's limited to very light data sets typically 1GB to 3GB total size. I can see an application where KV cache or something else small was processed/stored. please help clarify this for me. I can probably dig up the links where I found this original data specifying the limits of the ANE.

yiakwy-xpu-ml-framework-team · 2025-11-20T07:07:38Z

my understanding is that the ANE is 5X faster, from benchmarks that run small compute against it. it's limited to very light data sets typically 1GB to 3GB total size. I can see an application where KV cache or something else small was processed/stored. please help clarify this for me. I can probably dig up the links where I found this original data specifying the limits of the ANE.

In M3 Ultra (2 dies with total 80 gpu cores), metal gpu is faster than ANE. That is why MLX team won't support ANE. However we can still use to accelerate our computing.

Anemll · 2025-11-21T02:50:55Z

Compute is not faster on M3U gpu. Memory bandwidth is.

Anemll · 2025-11-21T02:55:46Z

28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U

yiakwy-xpu-ml-framework-team · 2025-11-21T03:31:07Z

28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U

Hi Anemll I just noticed your work. Great job.

Simply put, in our comprehensive benchmark (not publish yet), MPS can achieve 25 TFLOPS over the claimed 28 TFLOSP [1] . Note ANE is claimed to have 36 TOPS (integer) [1] , not TFLOPS.

You can verify ANE's performance from an a16w16 gemm with this test script : ANE is 5 times slower than MPS.

Our assumption of ANE is that it is a still capable computing unit, so we want to utilize it by dispatching main workload from MPS to ANE.

=====
Notably, even in private LLM we prefer flops instead of integer since integer quantization method cost energy more than flops especially you emplying some smoothing technique. And since integer mutliplication is essentially polynomial multiplication and shift operation is very expensive in circuites, float is more cheap than integer in multiplication (exponent is computed with add, mantissa has short length, which is opposition to some what research paper claimed as they didn't include scaling operations).

[1] https://en.wikipedia.org/wiki/Apple_silicon

Anemll · 2025-11-21T03:41:26Z

For M3U 36 is Fp16 due to dual cluster , for m4 max that is int8/4. M3 was the last ANE without accelerated int8. To reach full 30 + tflops you need batch 64 in the last tensor dimension.

…

On Thu, Nov 20, 2025 at 7:31 PM Yiakwy ***@***.***> wrote: *yiakwy-xpu-ml-framework-team* left a comment (ml-explore/mlx-lm#617) <#617 (comment)> 28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U Hi Anemll I just noticed your work. Simply put, in our comprehensive benchmark (not publish yet) MPS can achieve 25 TFLOPS over the claimed 28 TFLOSP [1] . Note ANE is claimed to have 36 TOPS (integer) not TFLOPS. You can verify ANE's performance a16w16 gemm performance with this test script : it 5 times slower than MPS. Notably , even in private LLM we prefer flops instead of integer since integer quantization method cost more than flops and since integer mutliplication is essentially polynomial multiplication and shift operation is very expensive in circuites, float is more cheap than integer in multiplication (as opposition to some what research paper claim, they didn't include scaling operations). [1] https://en.wikipedia.org/wiki/Apple_silicon — Reply to this email directly, view it on GitHub <#617 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BOG4QS4BSXPJUUZBFLB3MHD352BRDAVCNFSM6AAAAACMKWPN2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNRRGE3TKNBRGI> . You are receiving this because you commented.Message ID: ***@***.***>

yiakwy-xpu-ml-framework-team · 2025-11-24T04:08:02Z

For M3U 36 is Fp16 due to dual cluster , for m4 max that is int8/4. M3 was the last ANE without accelerated int8. To reach full 30 + tflops you need batch 64 in the last tensor dimension.
…
On Thu, Nov 20, 2025 at 7:31 PM Yiakwy @.> wrote: yiakwy-xpu-ml-framework-team left a comment (ml-explore/mlx-lm#617) <#617 (comment)> 28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U Hi Anemll I just noticed your work. Simply put, in our comprehensive benchmark (not publish yet) MPS can achieve 25 TFLOPS over the claimed 28 TFLOSP [1] . Note ANE is claimed to have 36 TOPS (integer) not TFLOPS. You can verify ANE's performance a16w16 gemm performance with this test script : it 5 times slower than MPS. Notably , even in private LLM we prefer flops instead of integer since integer quantization method cost more than flops and since integer mutliplication is essentially polynomial multiplication and shift operation is very expensive in circuites, float is more cheap than integer in multiplication (as opposition to some what research paper claim, they didn't include scaling operations). [1] https://en.wikipedia.org/wiki/Apple_silicon — Reply to this email directly, view it on GitHub <#617 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BOG4QS4BSXPJUUZBFLB3MHD352BRDAVCNFSM6AAAAACMKWPN2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNRRGE3TKNBRGI . You are receiving this because you commented.Message ID: @.>

Thanks . Do you have an example that can run with that output . Thx!

yiakwy-xpu-ml-framework-team added 4 commits November 17, 2025 14:13

add ane caller

2cdfc79

fix bugs with wrong api

8897a63

add performance benchmark

c8bb9c7

add torch ane mix gemm

b5eb270

yiakwy-xpu-ml-framework-team mentioned this pull request Nov 18, 2025

ANE support ml-explore/mlx#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ane op matmul #617

Add ane op matmul #617

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 17, 2025

Uh oh!

IntuitIntelLLC commented Nov 18, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 20, 2025

Uh oh!

Anemll commented Nov 21, 2025

Uh oh!

Anemll commented Nov 21, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 21, 2025 •

edited

Loading

Uh oh!

Anemll commented Nov 21, 2025 via email

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add ane op matmul #617

Are you sure you want to change the base?

Add ane op matmul #617

Uh oh!

Conversation

yiakwy-xpu-ml-framework-team commented Nov 17, 2025

Description

Methodoloy

Uh oh!

IntuitIntelLLC commented Nov 18, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 20, 2025

Uh oh!

Anemll commented Nov 21, 2025

Uh oh!

Anemll commented Nov 21, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Anemll commented Nov 21, 2025 via email

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiakwy-xpu-ml-framework-team commented Nov 21, 2025 •

edited

Loading