Add continuous batching runtime with paged KV, APC, and bench evidence #610
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed changes
--enable-continuous-batching: scheduler+runtime split, paged slot KV cache, prefix cache (APC) stats, paged attention backend shim, and slot generator updates so mid-decode admission plus server-lifetime KV reuse work for the OpenAI-compatible server.bench/bench_continuous_vs_static.py,bench/prefill_profile.py) and check in the evidence bundle underbenchmarks/referenced byPR_README.md, showing dense vs paged baselines, continuous batching TTFT/throughput, and memory telemetry.This implements the goals captured in
agent-reference/prd.md._Checklist
pre-commit run --all-files(ran pre-commit on the staged files;--all-fileswould rewrite untouched legacy files and benches)PYTHONPATH=../mlx/python:. .venv/bin/python -m pytest tests/server_batched -vvDEPENDS ON CHANGES IN MLX HERE: https://github.com/ml-explore/mlx/pull/2760