Skip to content

Conversation

@Sohailm25
Copy link

Proposed changes

  • Wire an end-to-end continuous batching runtime behind --enable-continuous-batching: scheduler+runtime split, paged slot KV cache, prefix cache (APC) stats, paged attention backend shim, and slot generator updates so mid-decode admission plus server-lifetime KV reuse work for the OpenAI-compatible server.
  • Add the array prefill runner + LLaMA graph overlay path so paged Metal kernels can overlap prefill/decode, plus the new attn backend shim that falls back cleanly when paged kernels/geometries aren’t available.
  • Extend the bench harness (bench/bench_continuous_vs_static.py, bench/prefill_profile.py) and check in the evidence bundle under benchmarks/ referenced by PR_README.md, showing dense vs paged baselines, continuous batching TTFT/throughput, and memory telemetry.
  • Cover the new behavior with focused unit tests: paged slot KV cache, paged attention patch, LLaMA prefill graph, model-runner metrics, array runner overlays, scheduler/runtimes, and the new paged model cache tests.

This implements the goals captured in agent-reference/prd.md._

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files (ran pre-commit on the staged files; --all-files would rewrite untouched legacy files and benches)
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (PR_README.md already describes the feature set; no user-facing README changes needed)
  • Tests: PYTHONPATH=../mlx/python:. .venv/bin/python -m pytest tests/server_batched -vv

DEPENDS ON CHANGES IN MLX HERE: https://github.com/ml-explore/mlx/pull/2760

@Sohailm25
Copy link
Author

Addresses these issues:

#499
#548
#178
#259
#480

@Sohailm25 Sohailm25 force-pushed the codex/connect-integration branch from 7e2b378 to 0d3ec31 Compare November 13, 2025 21:53
@awni
Copy link
Member

awni commented Nov 15, 2025

See corresponding comment in ml-explore/mlx#2760 (copied below):

Thanks for the contribution! At the moment this looks a bit early stage for a PR. If you want to keep running with it, I'd consider refining in your own fork. Then you could start an issue or discussion about if it makes sense to contribute this back to MLX.

Generally an interesting direction and I'd be curious to see more benchmarks around where something like this could help and by how much.

@awni awni closed this Nov 15, 2025
@Sohailm25
Copy link
Author

See corresponding comment in ml-explore/mlx#2760 (copied below):

Thanks for the contribution! At the moment this looks a bit early stage for a PR. If you want to keep running with it, I'd consider refining in your own fork. Then you could start an issue or discussion about if it makes sense to contribute this back to MLX.

Generally an interesting direction and I'd be curious to see more benchmarks around where something like this could help and by how much.

@awni broke it down to a smaller first pr, performed a bench, proved 6x TTFT reduction for continous batching!
#629

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants