-
Notifications
You must be signed in to change notification settings - Fork 312
Add Continuous Batching (flagged) with cache/idle robustness fixes (w/ reproducible tests and benchmarks) #629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Continuous Batching (flagged) with cache/idle robustness fixes (w/ reproducible tests and benchmarks) #629
Conversation
|
@awni this is a much more scoped out PR as opposed to the previous huge impl. it proves that continuous batching can be done and at a significant ttft future improvements are working on overall throughput. as of right now, its just smarter to use the standard path for low concurrency |
|
Thanks for this PR! I was testing this locally and noticed that even with continuous batching enabled, requests are still being handled serially (blocking each other) because the standard HTTPServer is single-threaded, you should probably update this as well. |
yo appreciate the heads up, I'll take a look at the PR and update. Not sure if the HTTP server changes weren't included in the commit |
Also, I like the extra logs showing with the server now, it can be updated to include more info such as t/s for each request. Something like what they do in llama.cpp or mistral.rs |
Thanks for this comment @otarkhan , just pushed changes that include:
Some more detailed info on the changes:
Evidence
|
Summary
remains unchanged as the default path.
What’s included
Results (meta-llama/Llama-3.1-8B, prompt_len=128, max_tokens=64, n=8)
Parity
How to validate
HF_HUB_ENABLE_HF_TRANSFER=1 HUGGINGFACE_HUB_TOKEN= PYTHONPATH=. python bench/bench_continuous_vs_static.py --repo meta-llama/Llama-3.1-8B --concurrency 4 --n 8 --max_tokens 64 --prompt_len 128
Sweep example in bench/logs/sweep_llama3p1_8b.log.
Scope / safety