Add Continuous Batching (flagged) with cache/idle robustness fixes (w/ reproducible tests and benchmarks) #629

Sohailm25 · 2025-11-21T20:35:45Z

Summary

Introduces the dense continuous-batching runtime: scheduler + ModelRunner + server wiring + bench harness (bench/bench_continuous_vs_static.py), enabled via the existing CLI flag. Static batch_generate
remains unchanged as the default path.
Adds robustness fixes to the new path:
- BatchKVCache rebatching to prevent shape/broadcast crashes when batch size changes mid-decode.
- ModelRunner idle retirement to avoid hangs when the generator returns no responses.

What’s included

Feature: Continuous batching runtime (dense SlotKV slabs, single-token ticks) integrated with server and bench script.
Fixes: rebatching + idle retire as above.
Tests:
- tests/test_batch_kv_cache_batch_mismatch.py
- tests/server_batched/test_model_runner_idle_retire.py
- Parity scaffold tests/server_batched/test_continuous_parity_small_model.py (skips without HF token).
Bench artifacts: bench/logs/llama3p1_8b_dense.log, bench/logs/sweep_llama3p1_8b.log.

Results (meta-llama/Llama-3.1-8B, prompt_len=128, max_tokens=64, n=8)

TTFT continuous vs static:
- c=1: 0.29s vs 3.04s
- c=2: 0.34s vs 3.03s
- c=4: 0.49s vs 3.04s
- c=8: 0.52–0.55s vs ~3.04s
Throughput (aggregate tok/s):
- static ~168–169; continuous 90–152 depending on concurrency (TTFT win, 10–45% throughput gap).

Parity

Manual parity (logged) on Llama-3.1-8B: continuous tokens matched batch_generate exactly for prompts “Hello world” / “Goodbye moon” (tokens: [[0,358,4344,264],[11,24748,21725,198]]).

How to validate

Unit/regression: PYTHONPATH=. python -m pytest tests/test_batch_kv_cache_batch_mismatch.py tests/server_batched/test_model_runner_idle_retire.py
Full suite: PYTHONPATH=. python -m pytest tests/server_batched
Bench (needs HF token + transfer):
HF_HUB_ENABLE_HF_TRANSFER=1 HUGGINGFACE_HUB_TOKEN= PYTHONPATH=. python bench/bench_continuous_vs_static.py --repo meta-llama/Llama-3.1-8B --concurrency 4 --n 8 --max_tokens 64 --prompt_len 128

Sweep example in bench/logs/sweep_llama3p1_8b.log.

Scope / safety

Continuous batching is opt-in via flag; static batching remains default and untouched.
No API breakage; changes are additive and limited to the new runtime.

…hing-dense

Sohailm25 · 2025-11-21T20:42:28Z

Fix for:

#499
#548
#178

Sohailm25 · 2025-11-21T20:46:01Z

@awni this is a much more scoped out PR as opposed to the previous huge impl. it proves that continuous batching can be done and at a significant ttft

future improvements are working on overall throughput. as of right now, its just smarter to use the standard path for low concurrency

otarkhan · 2025-11-23T21:06:52Z

Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well.

Sohailm25 · 2025-11-24T00:17:14Z

Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well.

yo appreciate the heads up, I'll take a look at the PR and update. Not sure if the HTTP server changes weren't included in the commit

otarkhan · 2025-11-24T00:52:32Z

Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well.

yo appreciate the heads up, I'll take a look at the PR and update. Not sure if the HTTP server changes weren't included in the commit

Also, I like the extra logs showing with the server now, it can be updated to include more info such as t/s for each request. Something like what they do in llama.cpp or mistral.rs

Sohailm25 · 2025-11-27T04:13:31Z

Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well.

Thanks for this comment @otarkhan , just pushed changes that include:

Threaded HTTP server (daemon threads) + locks for safe concurrency
Request-level throughput/latency logging

Some more detailed info on the changes:

HTTP server is now threaded by default (ThreadingHTTPServer subclass with daemon_threads=True): requests no longer serialize on a single accept loop and we added locks around model loading and prompt cache to keep it thread-safe
Request-level INFO logs include request_id, prompt_tps, generation_tps, total_tps, total_tokens, latency_ms, and finish_reason

Evidence

We ran a concurrency probe (model: mlx-community/Llama-3.1-8B-Instruct-4bit, continuous batching enabled): four parallel /v1/completions calls returned in 0.65–1.80s with distinct outputs, showing non-blocking handling
Bench on the same 8B model (n=8, concurrency=4, max_tokens=64, prompt_len=128):
- Static: 179.25 tok/s, TTFT ≈ 2.86s
- Continuous: 170.08 tok/s, TTFT ≈ 0.33s (≈8.5× faster first-token)

Sohailm25 added 6 commits November 21, 2025 10:23

Add continuous batching runtime and tests

17772c0

Trim continuous batching PR scope to dense path and clean benches

90de222

Add ABOUTME headers and format dense continuous batching files

36e7722

Add watchdog to bench_continuous_vs_static generator loop

4efba82

Fix batch cache rebatch and idle retire; add regressions

f1ae02f

Merge remote-tracking branch 'upstream/main' into pr1-continuous-batc…

170d8cc

…hing-dense

This was referenced Nov 21, 2025

Add Metal paged attention minimal support (flagged) and SDPA binding fix (OPTIONAL fast path for Continuous Batching PR) ml-explore/mlx#2814

Open

Add continuous batching runtime with paged KV, APC, and bench evidence #610

Closed

Sohailm25 added 3 commits November 26, 2025 21:44

Threaded server handling with request throughput logs

9151466

Guard generation stats when duration is zero

d240f6f

Make threaded HTTP server use daemon threads

50dcefd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Continuous Batching (flagged) with cache/idle robustness fixes (w/ reproducible tests and benchmarks) #629

Add Continuous Batching (flagged) with cache/idle robustness fixes (w/ reproducible tests and benchmarks) #629

Sohailm25 commented Nov 21, 2025

Uh oh!

Sohailm25 commented Nov 21, 2025

Uh oh!

Sohailm25 commented Nov 21, 2025

Uh oh!

otarkhan commented Nov 23, 2025

Uh oh!

Sohailm25 commented Nov 24, 2025

Uh oh!

otarkhan commented Nov 24, 2025

Uh oh!

Sohailm25 commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Continuous Batching (flagged) with cache/idle robustness fixes (w/ reproducible tests and benchmarks) #629

Are you sure you want to change the base?

Add Continuous Batching (flagged) with cache/idle robustness fixes (w/ reproducible tests and benchmarks) #629

Conversation

Sohailm25 commented Nov 21, 2025

Uh oh!

Sohailm25 commented Nov 21, 2025

Uh oh!

Sohailm25 commented Nov 21, 2025

Uh oh!

otarkhan commented Nov 23, 2025

Uh oh!

Sohailm25 commented Nov 24, 2025

Uh oh!

otarkhan commented Nov 24, 2025

Uh oh!

Sohailm25 commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants