Replicating Apple's LLM benchmarking #634

meghsat · 2025-11-25T22:59:46Z

meghsat
Nov 25, 2025

Hey all,

I came across this article: https://machinelearning.apple.com/research/exploring-llms-mlx-m5, where Apple claims to have achieved 2.87 sec TTFT on the MacBook Pro M5-24GB for the GPT-OSS-20B-MXFP4-Q4 model using MLX. However, I can’t seem to replicate those numbers — I’m getting a TTFT of ~8 sec.
Note: None of the models listed in the article are performing as claimed.
Here’s my benchmarking setup:

To measure TTFT, I had to modify the mlx_lm/generate.py script. Here’s the PR containing those changes: https://github.com/ml-explore/mlx-lm/pull/633/files
Once you add the TTFT logic, please run this code:

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load(
    "mlx-community/gpt-oss-20b-MXFP4-Q4", # You can replace with any model ID mentioned in the article.
    tokenizer_config={"trust_remote_code": True}
)

mx.eval(model.parameters())

vocab_size = tokenizer.vocab_size
prompt_length = 4096

mx.random.seed(0)

dummy_tokens = mx.random.randint(0, vocab_size, (prompt_length,)).tolist()

tokenizer._eos_token_ids = {}

# warmup
response = generate(
    model, 
    tokenizer, 
    prompt=dummy_tokens, 
    max_tokens=128, 
    verbose=True,
    prefill_step_size=4096 
)

# Actual run
response = generate(
    model, 
    tokenizer, 
    prompt=dummy_tokens, 
    max_tokens=128, 
    verbose=True,
    prefill_step_size=4096  
)

It would be great if anyone has observed similar or different results and could share their setup here. Thanks in advance.

awni · 2025-11-25T23:05:44Z

awni
Nov 25, 2025
Maintainer

What's your hardware / OS? To reproduce those numbers you need the latest MLX (0.30.0) on macOS 26.2 (beta release) on the M5.

2 replies

meghsat Nov 25, 2025
Author

Thank you for pointing it out. I'm on macOS 26.1.

meghsat Nov 26, 2025
Author

I am able to reproduce the numbers using MacOS 26.2. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replicating Apple's LLM benchmarking #634

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Replicating Apple's LLM benchmarking #634

Uh oh!

meghsat Nov 25, 2025

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

awni Nov 25, 2025 Maintainer

Uh oh!

meghsat Nov 25, 2025 Author

Uh oh!

meghsat Nov 26, 2025 Author

meghsat
Nov 25, 2025

Replies: 1 comment 2 replies

awni
Nov 25, 2025
Maintainer

meghsat Nov 25, 2025
Author

meghsat Nov 26, 2025
Author