Add continuous batching runtime with paged KV, APC, and bench evidence #610

Sohailm25 · 2025-11-13T21:44:03Z

Proposed changes

Wire an end-to-end continuous batching runtime behind --enable-continuous-batching: scheduler+runtime split, paged slot KV cache, prefix cache (APC) stats, paged attention backend shim, and slot generator updates so mid-decode admission plus server-lifetime KV reuse work for the OpenAI-compatible server.
Add the array prefill runner + LLaMA graph overlay path so paged Metal kernels can overlap prefill/decode, plus the new attn backend shim that falls back cleanly when paged kernels/geometries aren’t available.
Extend the bench harness (bench/bench_continuous_vs_static.py, bench/prefill_profile.py) and check in the evidence bundle under benchmarks/ referenced by PR_README.md, showing dense vs paged baselines, continuous batching TTFT/throughput, and memory telemetry.
Cover the new behavior with focused unit tests: paged slot KV cache, paged attention patch, LLaMA prefill graph, model-runner metrics, array runner overlays, scheduler/runtimes, and the new paged model cache tests.

This implements the goals captured in agent-reference/prd.md._

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files (ran pre-commit on the staged files; --all-files would rewrite untouched legacy files and benches)
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (PR_README.md already describes the feature set; no user-facing README changes needed)
Tests: PYTHONPATH=../mlx/python:. .venv/bin/python -m pytest tests/server_batched -vv

DEPENDS ON CHANGES IN MLX HERE: https://github.com/ml-explore/mlx/pull/2760

Sohailm25 · 2025-11-13T21:46:47Z

Addresses these issues:

#499
#548
#178
#259
#480

awni · 2025-11-15T00:32:42Z

See corresponding comment in ml-explore/mlx#2760 (copied below):

Thanks for the contribution! At the moment this looks a bit early stage for a PR. If you want to keep running with it, I'd consider refining in your own fork. Then you could start an issue or discussion about if it makes sense to contribute this back to MLX.

Generally an interesting direction and I'd be curious to see more benchmarks around where something like this could help and by how much.

Sohailm25 · 2025-11-21T20:39:30Z

See corresponding comment in ml-explore/mlx#2760 (copied below):

Thanks for the contribution! At the moment this looks a bit early stage for a PR. If you want to keep running with it, I'd consider refining in your own fork. Then you could start an issue or discussion about if it makes sense to contribute this back to MLX.

Generally an interesting direction and I'd be curious to see more benchmarks around where something like this could help and by how much.

@awni broke it down to a smaller first pr, performed a bench, proved 6x TTFT reduction for continous batching!
#629

Sohailm25 added 7 commits November 13, 2025 15:49

Add continuous batching runtime and tests

ac58b59

Respect EOS stop conditions and expose runtime benchmarking knobs

49f4d4d

Propagate eos stop tokens to streaming contexts

c23e9c2

Implement scratch-based slot batching delta copying

34c03b9

Align bench parity with shared sampling controls

2d3172e

Add KV slab guard and document paged-attention follow-up

0dad423

Add continuous batching runtime with paged KV + benches

0d3ec31

Sohailm25 force-pushed the codex/connect-integration branch from 7e2b378 to 0d3ec31 Compare November 13, 2025 21:53

Remove bench logs and benchmark artifacts

0257fca

awni closed this Nov 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add continuous batching runtime with paged KV, APC, and bench evidence #610

Add continuous batching runtime with paged KV, APC, and bench evidence #610

Uh oh!

Sohailm25 commented Nov 13, 2025

Uh oh!

Sohailm25 commented Nov 13, 2025

Uh oh!

awni commented Nov 15, 2025

Uh oh!

Sohailm25 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add continuous batching runtime with paged KV, APC, and bench evidence #610

Add continuous batching runtime with paged KV, APC, and bench evidence #610

Uh oh!

Conversation

Sohailm25 commented Nov 13, 2025

Proposed changes

Checklist

Uh oh!

Sohailm25 commented Nov 13, 2025

Uh oh!

awni commented Nov 15, 2025

Uh oh!

Sohailm25 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants