v0.19.0 #4192

kaiyux · 2025-05-09T12:55:03Z

kaiyux
May 9, 2025
Maintainer

TensorRT-LLM Release 0.19.0

Key Features and Enhancements

The C++ runtime is now open sourced.
PyTorch workflow
- Added DeepSeek V3/R1 support. Refer to examples/deepseek_v3/README.md, also to the blog docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md.
- Added Llava-Next support.
- Added BERT support.
- Added a C++ based decoder, which added support for:
  - TopK / TopP.
  - Bad words.
  - Stop words.
  - Embedding bias.
- Added Autotuner for custom-op-compatible tuning process.
  - Added a Python-based Autotuner core framework for kernel tuning.
  - Applied the Autotuner to fused MoE and NVFP4 linear operators for concept and performance evaluations.
- Added guided decoding support (XGrammar integration).
- Added pipeline parallelism support for the overlap scheduler in PyExecutor.
- Added Qwen2VL model support.
- Added mixed precision quantization support.
- Added pipeline parallelism with attention DP support.
- Added no-cache attention support.
- Added PeftCacheManager support.
- Added Qwen2.5‑VL support and refactored Qwen2‑VL.
- Added trtllm‑gen FP4 GEMM support.
- Added Qwen2 MoE support.
- Applied AutoTuner to both Fused MoE and NVFP4 Linear operators.
- Introduced a UserBuffers allocator.
- Added Deepseek eager mode AllReduce fusion support.
- Added Multi-Token Prediction (MTP) support. Refer to the “Multi-Token Prediction (MTP)” section of examples/deepseek_v3/README.md.
- Added FlashMLA support for SM90.
- Added support for enabling MTP with CUDA graph padding.
- Added initial EAGLE-3 implementation.
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs.
AutoDeploy for PyTorch workflow.
- The AutoDeploy for PyTorch workflow is an experimental feature in tensorrt_llm._torch.auto_deploy.
- AutoDeploy provides an automated path from off-the-shelf models to optimized deployment in the TensorRT-LLM runtime.
- Check out examples/auto_deploy/README.md for more details.
LLM API
- [BREAKING CHANGE] Added dynamic logits processor support, and deprecated static logits processor.
- Added batched logits processor support.
- Added EAGLE support.
- Added abort request support.
- Added get_stats support.
- Added multi-node support for Slurm-based clusters, refer to examples/llm-api/llm_mgmn_*.sh.
Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in examples/multimodal/README.md.
Added INT4-AWQ support for MoE models. Refer to the “AWQ Quantization” section in examples/mixtral/README.md.
Added Qwen2-Audio support. Refer to examples/qwen2audio/README.md.
Added Language-Adapter support. Refer to examples/language_adapter/README.md.
Added STDiT for OpenSoRA text-to-video support. Refer to examples/stdit/README.md.
Added vision encoders with tensor parallelism and context parallelism support. Refer to examples/vit/README.md.
Added EXAONE-Deep support. Refer to examples/exaone/README.md.
Added support for Phi-4-mini and Phi‑4‑MM.
Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md.
Added FP8 quantization support for Qwen2-VL.
Added batched inference support for the LLM API MMLU example examples/mmlu_llmapi.py.
Added FP4 quantization-layernorm fusion plugin support. (Llama models only)
Added Mamba-Hybrid support.
Added NVILA video support. The support includes 1 prompt - N media and N prompt - N media batching modes.
Added a --quantize_lm_head option examples/quantization/quantize.py to support lm_head quantization.
Added batched tensor FP4 quantization support.
Added a /metrics endpoint for trtllm-serve to log iteration statistics.
Added LoRA support for Phi-2 model.
Added returning context logits support for trtllm-serve.
Added one-shot version for UserBuffer AllReduce-Normalization on FP16/BF16.
Added request BW metric measurement for disaggServerBenchmark.
Updated logits bitmask kernel to v3.
Enabled CUDA graphs when attention DP was used and active requests on different GPUs were uneven.
Added iteration log support for trtllm-bench.
fp8_blockscale_gemm is now open-sourced.
Added AWQ support for ModelOpt checkpoints.
Added Linear block scale layout support in FP4 quantization.
Added pre-quantized FP8 checkpoint support for Nemotron-mini-4b-instruct.
Added Variable-Beam-Width-Search (VBWS) support (part2).
Added LoRA support for Gemma.
Refactored scaffolding worker, added OpenAI API worker support.
Optionally split MoE inputs into chunks to reduce GPU memory usage.
Added UCX IP interface support.
[BREAKING CHANGE] Added output of first token to additional generation outputs.
Added FP8 support for SM120 architecture.
Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options.
Made the scaffolding Controller more generic.
Breaking change: Added individual gatherContext support for each additional output.
Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager.
Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging.
Supported aborting disconnected requests.
Added an option to run disaggregated serving without context servers.
Fixed and improved allreduce and fusion kernels.
Enhanced the integrated robustness of scaffolding via init.py.

API Changes

Exposed kc_cache_retention_config from C++ executor API to the LLM API.
Moved BuildConfig arguments to LlmArgs.
Removed speculative decoding parameters from stateful decoders.
Exposed DecoderState via bindings and integrated it in decoder.
Refactored the LlmArgs with Pydantic and migrated remaining pybinding configurations to Python.
Refactored disaggregated serving scripts.
Added numNodes to ParallelConfig.
Redesigned the multi‑stream API for DeepSeek.

Fixed Issues

Fixed misused length argument of PluginField. Thanks to the contribution from @jl749 in fix: gptattentionplugin onnxparser compatability #2712. This also fixes GPTAttentionPlugin missing declaration of fields #2685.
Fixed a Llama-3.2 SmoothQuant convert checkpoint issue. (Llama-3.2 SmoothQuant convert checkpoint error #2677)
Fixed a bug when loading an engine using LoRA through the LLM API. (Bug when loading an engine using LoRA through LLM API #2782)
Fixed incorrect batch slot usage in addCumLogProbs kernel. Thanks to the contribution from @aotman in Fix Incorrect Batch Slot Usage in addCumLogProbs Kernel #2787.
Fixed incorrect output for Llama-3.2-11B-Vision-Instruct. (Multimodal Cross-attention incorrect results in #2796)
Removed the necessary of --extra-index-url https://pypi.nvidia.com when running pip install tensorrt-llm.

Infrastructure Changes

The dependent NVIDIA ModelOpt version is updated to 0.27.

Known Issues

The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the PyTorch NGC Container for optimal support on SBSA platforms.

This discussion was created from the release v0.19.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.19.0 #4192

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v0.19.0 #4192

Uh oh!

kaiyux May 9, 2025 Maintainer

TensorRT-LLM Release 0.19.0

Key Features and Enhancements

API Changes

Fixed Issues

Infrastructure Changes

Known Issues

Replies: 0 comments

kaiyux
May 9, 2025
Maintainer