Skip to content

Conversation

@Sohailm25
Copy link

Summary

  • Introduces the dense continuous-batching runtime: scheduler + ModelRunner + server wiring + bench harness (bench/bench_continuous_vs_static.py), enabled via the existing CLI flag. Static batch_generate
    remains unchanged as the default path.
  • Adds robustness fixes to the new path:
    • BatchKVCache rebatching to prevent shape/broadcast crashes when batch size changes mid-decode.
    • ModelRunner idle retirement to avoid hangs when the generator returns no responses.

What’s included

  • Feature: Continuous batching runtime (dense SlotKV slabs, single-token ticks) integrated with server and bench script.
  • Fixes: rebatching + idle retire as above.
  • Tests:
    • tests/test_batch_kv_cache_batch_mismatch.py
    • tests/server_batched/test_model_runner_idle_retire.py
    • Parity scaffold tests/server_batched/test_continuous_parity_small_model.py (skips without HF token).
  • Bench artifacts: bench/logs/llama3p1_8b_dense.log, bench/logs/sweep_llama3p1_8b.log.

Results (meta-llama/Llama-3.1-8B, prompt_len=128, max_tokens=64, n=8)

  • TTFT continuous vs static:
    • c=1: 0.29s vs 3.04s
    • c=2: 0.34s vs 3.03s
    • c=4: 0.49s vs 3.04s
    • c=8: 0.52–0.55s vs ~3.04s
  • Throughput (aggregate tok/s):
    • static ~168–169; continuous 90–152 depending on concurrency (TTFT win, 10–45% throughput gap).

Parity

  • Manual parity (logged) on Llama-3.1-8B: continuous tokens matched batch_generate exactly for prompts “Hello world” / “Goodbye moon” (tokens: [[0,358,4344,264],[11,24748,21725,198]]).

How to validate

  • Unit/regression: PYTHONPATH=. python -m pytest tests/test_batch_kv_cache_batch_mismatch.py tests/server_batched/test_model_runner_idle_retire.py
  • Full suite: PYTHONPATH=. python -m pytest tests/server_batched
  • Bench (needs HF token + transfer):
    HF_HUB_ENABLE_HF_TRANSFER=1 HUGGINGFACE_HUB_TOKEN= PYTHONPATH=. python bench/bench_continuous_vs_static.py --repo meta-llama/Llama-3.1-8B --concurrency 4 --n 8 --max_tokens 64 --prompt_len 128

Sweep example in bench/logs/sweep_llama3p1_8b.log.

Scope / safety

  • Continuous batching is opt-in via flag; static batching remains default and untouched.
  • No API breakage; changes are additive and limited to the new runtime.

@Sohailm25
Copy link
Author

Fix for:

#499
#548
#178

@Sohailm25
Copy link
Author

@awni this is a much more scoped out PR as opposed to the previous huge impl. it proves that continuous batching can be done and at a significant ttft

future improvements are working on overall throughput. as of right now, its just smarter to use the standard path for low concurrency

@otarkhan
Copy link

Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well.

@Sohailm25
Copy link
Author

Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well.

yo appreciate the heads up, I'll take a look at the PR and update. Not sure if the HTTP server changes weren't included in the commit

@otarkhan
Copy link

Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well.

yo appreciate the heads up, I'll take a look at the PR and update. Not sure if the HTTP server changes weren't included in the commit

Also, I like the extra logs showing with the server now, it can be updated to include more info such as t/s for each request. Something like what they do in llama.cpp or mistral.rs

@Sohailm25
Copy link
Author

Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well.

Thanks for this comment @otarkhan , just pushed changes that include:

  • Threaded HTTP server (daemon threads) + locks for safe concurrency
  • Request-level throughput/latency logging

Some more detailed info on the changes:

  • HTTP server is now threaded by default (ThreadingHTTPServer subclass with daemon_threads=True): requests no longer serialize on a single accept loop and we added locks around model loading and prompt cache to keep it thread-safe
  • Request-level INFO logs include request_id, prompt_tps, generation_tps, total_tps, total_tokens, latency_ms, and finish_reason

Evidence

  • We ran a concurrency probe (model: mlx-community/Llama-3.1-8B-Instruct-4bit, continuous batching enabled): four parallel /v1/completions calls returned in 0.65–1.80s with distinct outputs, showing non-blocking handling

  • Bench on the same 8B model (n=8, concurrency=4, max_tokens=64, prompt_len=128):

    • Static: 179.25 tok/s, TTFT ≈ 2.86s
    • Continuous: 170.08 tok/s, TTFT ≈ 0.33s (≈8.5× faster first-token)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants