Releases: vllm-project/tpu-inference
Releases · vllm-project/tpu-inference
v0.12.0
This release brings several new features and improvements for vLLM TPU Inference.
Highlights
Async Scheduler Enabled the async-scheduler in tpu-inference for improved performance on smaller models.
Spec Decoder EAGLE-3 Added support EAGLE-3 variant with verified performance for Llama 3.1-8B.
Out-of-Tree Model Support Load custom JAX models as plugins, enabling users to serve custom model architectures without forking or modifying vLLM internals.
Automated CI/CD and Pre-merge Check Improved the testing and validation pipeline with automated CI/CD and pre-merge checks to enhance stability and accelerate iteration. More improvements to come.
What's Changed
- [Bug Fix] Fix small bug in server-based profiling init by @jrplatin in #872
- [Disagg][Bugfix] add check for global devices in profiler start by @sixiang-google in #874
- [CI] Fix imports to catchup vLLM's recent update. by @hfan in #876
- [Kernel] Added a RPA V3 kernel variant optimized for head_dim=64 by @yaochengji in #875
- Update README.md by @bvrockwell in #880
- Remove convert_list_to_device_array to reduce latency between model forward pass by @Lumosis in #879
- fix mock device error in profile enabling by @sixiang-google in #878
- [RPA] Reduce VREG spill by optimize masking by @kyuyeunk in #818
- [Doc] Fixed the docker path for the quick start guide by @hosseinsarshar in #885
- Fix/docs links by @RobMulla in #873
- Light rewording of jax model development readme by @gpolovets1 in #871
- Revert "[CI] Fix imports to catchup vLLM's recent update." by @hfan in #887
- docs: Clarify support matrix messaging by @RobMulla in #886
- Docs rename reco page by @RobMulla in #888
- Update README.md by @bvrockwell in #897
- [Profiling] Pull Over the TPU Profiler from vLLM + add profiling docs by @jrplatin in #882
- [Misc] Fix various vLLM import issues by @jrplatin in #900
- Revert "[Misc] Fix various vLLM import issues" by @hfan in #902
- [Misc] Fix failing phased-based profiling test by @jrplatin in #905
- Added the docker login instructions by @hosseinsarshar in #891
- Unpin upstream vllm version by @jcyang43 in #904
- [Bug fix] Fix v7 HBM limit by @wenxindongwork in #903
- Enable spmd on lora by @vanbasten23 in #829
- Support --enforce-eager by @kyuyeunk in #907
- [CI] Fixes to catchup with vllm changes by @hfan in #912
- [Docker] Add V7X requirements and update Docker to accept option to build using it by @jrplatin in #916
- Fix the jax device ordering. by @wang2yn84 in #915
- Update the disagg multi host sh file to setup the disagg inference in… by @mrjunwan-lang in #922
- [Llama4/JAX] Refactor RoPE Scaling, QK Norm, and No-RoPE Layer Config Handling for Maverick by @sierraisland in #923
- [Bug fix + Qwix] Add JAX quantization YAMLs to WHL build + add fp8 quantization configs by @jrplatin in #929
- Enable multi-host P/D and adopt the vllm distributed executor changes by @mrjunwan-lang in #932
- fix the uniitest to adopt vllm API changes by @mrjunwan-lang in #933
- [CI] Fix Qwen2.5 VL get_mrope_input_positions after vLLM change. by @kwang3939 in #934
- [Disagg] Use pathways resharding api to handle transfer by @sixiang-google in #935
- [Misc] Report TPU usage by @hfan in #925
- [CI] Use real vLLM ModelConfig object in init_device test by @hfan in #937
- update the ports to make the ports consistent in single host and multihost by @mrjunwan-lang in #938
- [Spec Decoding] Merge jitted helpers for eagle3 by @Lumosis in #920
- [GPT-OSS] JAX implementation of GPT-OSS by @bzgoogle in #861
- [Bug fixes] Update vLLM imports by @jrplatin in #947
- [Misc] Move numba installation to requirements.txt by @py4 in #948
- [Multi-host] Fix bugs in the deployment script by @Lumosis in #940
- Fix issues when running multiple LoRA tests on the v6e-8 machine. by @vanbasten23 in #926
- [Bug fixes] Fix a few more vLLM imports + Dockerfile typo by @jrplatin in #953
- Add the bgmv tests by @vanbasten23 in #942
- [MMLU] Add chat-template support for MMLU by @bzgoogle in #952
- [RPA] Add attention sink support to 64 dim variant of RPA kernel by @kyuyeunk in #958
- Revert "Add the bgmv tests" by @vanbasten23 in #963
- fix the vllm import issue for round_down by @mrjunwan-lang in #965
- Update docs to include installation guide with building from source. by @RobMulla in #949
- Reduce the host overhead for LoRA by @vanbasten23 in #930
- [GPT-OSS] uncomment sink related changes as the kernel_hd64.py was merged by @bzgoogle in #966
- Add bgmv test by @vanbasten23 in #964
- [CI] Skip build if only docs/icons changed by @boe20211 in #908
- [Spec Decoding] Fix precompilation by @Lumosis in #960
- fix the bug in kv transfer params is None by @mrjunwan-lang in #969
- [GPT-OSS] fix unstable sparse sum among different by @bzgoogle in #968
- fused Moe by @bythew3i in #973
- fix readme links to the docs by @RobMulla in #974
- [Feature] Code implementation of Async Scheduler by @cychiuak in #924
- [Misc] Fix observability config to prevent error from upstream by @py4 in #979
- add unit test for tpu_connector.py by @mrjunwan-lang in #980
- [Model] Add vision encoder and input embeddings merger warmup for Qwen2.5 VL model by @kwang3939 in #972
- Fix the test of multimodal manager by @kwang3939 in #986
- Fix the test of tpu_jax_runner by @kwang3939 in #989
- [Misc] Attempt to fix hash mismatch in CI if it's because of incomplete download by @py4 in #994
- [RPA] Update attention_sink to use prepare_inputs by @kyuyeunk in #993
- [Misc] Only run JAX unit tests and few e2e tests for each PR in CI. by @py4 in #995
- [Misc] Remove unused interfaces by @py4 in #990
- [Misc] Fix buildkite yaml format. by @py4 in #997
- Update README.md by @bvrockwell in #998
- Fix kv cache shape for head_dim=64 by @yaochengji in #976
- Add precommit hook for detecting missing init.py files by @jcyang43 in #1001
- Fix grid size calculation in qwen2.5-vl vision encoder warmup by @kwang3939 in #1004
- [Runner] Separate execute_model and sample_tokens to adapt upstream change. by @py4 in #1003
- [Misc] Change buildkite pipeline to run all steps but skip some through command by @py4 in https://github.com/vllm-project/tpu-i...