Releases · vllm-project/tpu-inference

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Async Scheduler Enabled the async-scheduler in tpu-inference for improved performance on smaller models.

Spec Decoder EAGLE-3 Added support EAGLE-3 variant with verified performance for Llama 3.1-8B.

Out-of-Tree Model Support Load custom JAX models as plugins, enabling users to serve custom model architectures without forking or modifying vLLM internals.

Automated CI/CD and Pre-merge Check Improved the testing and validation pipeline with automated CI/CD and pre-merge checks to enhance stability and accelerate iteration. More improvements to come.

What's Changed

[Bug Fix] Fix small bug in server-based profiling init by @jrplatin in #872
[Disagg][Bugfix] add check for global devices in profiler start by @sixiang-google in #874
[CI] Fix imports to catchup vLLM's recent update. by @hfan in #876
[Kernel] Added a RPA V3 kernel variant optimized for head_dim=64 by @yaochengji in #875
Update README.md by @bvrockwell in #880
Remove convert_list_to_device_array to reduce latency between model forward pass by @Lumosis in #879
fix mock device error in profile enabling by @sixiang-google in #878
[RPA] Reduce VREG spill by optimize masking by @kyuyeunk in #818
[Doc] Fixed the docker path for the quick start guide by @hosseinsarshar in #885
Fix/docs links by @RobMulla in #873
Light rewording of jax model development readme by @gpolovets1 in #871
Revert "[CI] Fix imports to catchup vLLM's recent update." by @hfan in #887
docs: Clarify support matrix messaging by @RobMulla in #886
Docs rename reco page by @RobMulla in #888
Update README.md by @bvrockwell in #897
[Profiling] Pull Over the TPU Profiler from vLLM + add profiling docs by @jrplatin in #882
[Misc] Fix various vLLM import issues by @jrplatin in #900
Revert "[Misc] Fix various vLLM import issues" by @hfan in #902
[Misc] Fix failing phased-based profiling test by @jrplatin in #905
Added the docker login instructions by @hosseinsarshar in #891
Unpin upstream vllm version by @jcyang43 in #904
[Bug fix] Fix v7 HBM limit by @wenxindongwork in #903
Enable spmd on lora by @vanbasten23 in #829
Support --enforce-eager by @kyuyeunk in #907
[CI] Fixes to catchup with vllm changes by @hfan in #912
[Docker] Add V7X requirements and update Docker to accept option to build using it by @jrplatin in #916
Fix the jax device ordering. by @wang2yn84 in #915
Update the disagg multi host sh file to setup the disagg inference in… by @mrjunwan-lang in #922
[Llama4/JAX] Refactor RoPE Scaling, QK Norm, and No-RoPE Layer Config Handling for Maverick by @sierraisland in #923
[Bug fix + Qwix] Add JAX quantization YAMLs to WHL build + add fp8 quantization configs by @jrplatin in #929
Enable multi-host P/D and adopt the vllm distributed executor changes by @mrjunwan-lang in #932
fix the uniitest to adopt vllm API changes by @mrjunwan-lang in #933
[CI] Fix Qwen2.5 VL get_mrope_input_positions after vLLM change. by @kwang3939 in #934
[Disagg] Use pathways resharding api to handle transfer by @sixiang-google in #935
[Misc] Report TPU usage by @hfan in #925
[CI] Use real vLLM ModelConfig object in init_device test by @hfan in #937
update the ports to make the ports consistent in single host and multihost by @mrjunwan-lang in #938
[Spec Decoding] Merge jitted helpers for eagle3 by @Lumosis in #920
[GPT-OSS] JAX implementation of GPT-OSS by @bzgoogle in #861
[Bug fixes] Update vLLM imports by @jrplatin in #947
[Misc] Move numba installation to requirements.txt by @py4 in #948
[Multi-host] Fix bugs in the deployment script by @Lumosis in #940
Fix issues when running multiple LoRA tests on the v6e-8 machine. by @vanbasten23 in #926
[Bug fixes] Fix a few more vLLM imports + Dockerfile typo by @jrplatin in #953
Add the bgmv tests by @vanbasten23 in #942
[MMLU] Add chat-template support for MMLU by @bzgoogle in #952
[RPA] Add attention sink support to 64 dim variant of RPA kernel by @kyuyeunk in #958
Revert "Add the bgmv tests" by @vanbasten23 in #963
fix the vllm import issue for round_down by @mrjunwan-lang in #965
Update docs to include installation guide with building from source. by @RobMulla in #949
Reduce the host overhead for LoRA by @vanbasten23 in #930
[GPT-OSS] uncomment sink related changes as the kernel_hd64.py was merged by @bzgoogle in #966
Add bgmv test by @vanbasten23 in #964
[CI] Skip build if only docs/icons changed by @boe20211 in #908
[Spec Decoding] Fix precompilation by @Lumosis in #960
fix the bug in kv transfer params is None by @mrjunwan-lang in #969
[GPT-OSS] fix unstable sparse sum among different by @bzgoogle in #968
fused Moe by @bythew3i in #973
fix readme links to the docs by @RobMulla in #974
[Feature] Code implementation of Async Scheduler by @cychiuak in #924
[Misc] Fix observability config to prevent error from upstream by @py4 in #979
add unit test for tpu_connector.py by @mrjunwan-lang in #980
[Model] Add vision encoder and input embeddings merger warmup for Qwen2.5 VL model by @kwang3939 in #972
Fix the test of multimodal manager by @kwang3939 in #986
Fix the test of tpu_jax_runner by @kwang3939 in #989
[Misc] Attempt to fix hash mismatch in CI if it's because of incomplete download by @py4 in #994
[RPA] Update attention_sink to use prepare_inputs by @kyuyeunk in #993
[Misc] Only run JAX unit tests and few e2e tests for each PR in CI. by @py4 in #995
[Misc] Remove unused interfaces by @py4 in #990
[Misc] Fix buildkite yaml format. by @py4 in #997
Update README.md by @bvrockwell in #998
Fix kv cache shape for head_dim=64 by @yaochengji in #976
Add precommit hook for detecting missing init.py files by @jcyang43 in #1001
Fix grid size calculation in qwen2.5-vl vision encoder warmup by @kwang3939 in #1004
[Runner] Separate execute_model and sample_tokens to adapt upstream change. by @py4 in #1003
[Misc] Change buildkite pipeline to run all steps but skip some through command by @py4 in https://github.com/vllm-project/tpu-i...