Skip to content

Conversation

@yiakwy-xpu-ml-framework-team

Description

Both MLX and pytorch Metal backends do not utilize Apple ANE : a 32 cores capable computing unit.

Unlike NVIDIA GPU, unified memory means activation visible to GPU can be fetched to ANE ei ngine only requires copy from unified memory to ANE back and forth.

For M x N matrix , if we can dispatch k/M rows to ANE while GPU is fully occupied, we should expect to see some improvements of computing

Methodoloy

We first added a ANE calling engine, then we verified how ANE can work together with different backends.

To our best knowledge, ANE is almost 1/3 ~ 1/5 slower than Metal3 GPU in M3 Ultra:

ANE

And Pytorch MPS backend does not work with unified memory very well

(tensor.data_ptr() != tensor.cpu().data_ptr())

Hence we are tuning around this range with MLX backend to get even better performance.

@IntuitIntelLLC
Copy link

my understanding is that the ANE is 5X faster, from benchmarks that run small compute against it. it's limited to very light data sets typically 1GB to 3GB total size. I can see an application where KV cache or something else small was processed/stored. please help clarify this for me. I can probably dig up the links where I found this original data specifying the limits of the ANE.

@yiakwy-xpu-ml-framework-team
Copy link
Author

my understanding is that the ANE is 5X faster, from benchmarks that run small compute against it. it's limited to very light data sets typically 1GB to 3GB total size. I can see an application where KV cache or something else small was processed/stored. please help clarify this for me. I can probably dig up the links where I found this original data specifying the limits of the ANE.

In M3 Ultra (2 dies with total 80 gpu cores), metal gpu is faster than ANE. That is why MLX team won't support ANE. However we can still use to accelerate our computing.

@Anemll
Copy link

Anemll commented Nov 21, 2025

Compute is not faster on M3U gpu. Memory bandwidth is.

@Anemll
Copy link

Anemll commented Nov 21, 2025

28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U

@yiakwy-xpu-ml-framework-team
Copy link
Author

yiakwy-xpu-ml-framework-team commented Nov 21, 2025

28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U

Hi Anemll I just noticed your work. Great job.

Simply put, in our comprehensive benchmark (not publish yet), MPS can achieve 25 TFLOPS over the claimed 28 TFLOSP [1] . Note ANE is claimed to have 36 TOPS (integer) [1] , not TFLOPS.

You can verify ANE's performance from an a16w16 gemm with this test script : ANE is 5 times slower than MPS.

Our assumption of ANE is that it is a still capable computing unit, so we want to utilize it by dispatching main workload from MPS to ANE.

=====
Notably, even in private LLM we prefer flops instead of integer since integer quantization method cost energy more than flops especially you emplying some smoothing technique. And since integer mutliplication is essentially polynomial multiplication and shift operation is very expensive in circuites, float is more cheap than integer in multiplication (exponent is computed with add, mantissa has short length, which is opposition to some what research paper claimed as they didn't include scaling operations).

[1] https://en.wikipedia.org/wiki/Apple_silicon

@Anemll
Copy link

Anemll commented Nov 21, 2025 via email

@yiakwy-xpu-ml-framework-team
Copy link
Author

For M3U 36 is Fp16 due to dual cluster , for m4 max that is int8/4. M3 was the last ANE without accelerated int8. To reach full 30 + tflops you need batch 64 in the last tensor dimension.

On Thu, Nov 20, 2025 at 7:31 PM Yiakwy @.> wrote: yiakwy-xpu-ml-framework-team left a comment (ml-explore/mlx-lm#617) <#617 (comment)> 28 TFLOPs for GPU and 36 TFLOPs on ANE for M3U Hi Anemll I just noticed your work. Simply put, in our comprehensive benchmark (not publish yet) MPS can achieve 25 TFLOPS over the claimed 28 TFLOSP [1] . Note ANE is claimed to have 36 TOPS (integer) not TFLOPS. You can verify ANE's performance a16w16 gemm performance with this test script : it 5 times slower than MPS. Notably , even in private LLM we prefer flops instead of integer since integer quantization method cost more than flops and since integer mutliplication is essentially polynomial multiplication and shift operation is very expensive in circuites, float is more cheap than integer in multiplication (as opposition to some what research paper claim, they didn't include scaling operations). [1] https://en.wikipedia.org/wiki/Apple_silicon — Reply to this email directly, view it on GitHub <#617 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BOG4QS4BSXPJUUZBFLB3MHD352BRDAVCNFSM6AAAAACMKWPN2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNRRGE3TKNBRGI . You are receiving this because you commented.Message ID: @.>

Thanks . Do you have an example that can run with that output . Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants