TensorRT-LLM Release 0.16.0 #2614
kaiyux
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
We are very pleased to announce the 0.16.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
examples/recurrentgemma/README.md.examples/llama/README.md.max_num_tokensdynamic tuning feature, which can be enabled by setting--enable_max_num_tokens_tuningtogptManagerBenchmark.max_num_tokensandmax_batch_sizearguments to control the runtime parameters.extended_runtime_perf_knob_configto enable various performance configurations.AutoAWQcheckpoints support for Qwen. Refer to the “INT4-AWQ” section inexamples/qwen/README.md.AutoAWQandAutoGPTQHugging Face checkpoints support for LLaMA. (Is it possible load quantized model from huggingface? #2458)allottedTimeMsto the C++Requestclass to support per-request timeout.API Changes
enable_xqaargument fromtrtllm-build.--use_embedding_sharingfrom convert checkpoints scripts.if __name__ == "__main__"entry point is required for both single-GPU and multi-GPU cases when using theLLMAPI.enable_chunked_prefillflag to theLlmArgsof theLLMAPI.trtllm-buildcommand.Model Updates
examples/multimodal/README.md.examples/multimodal.examples/sdxl/README.md. Thanks for the contribution from @Zars19 in Support SDXL and its distributed inference #1514.Fixed Issues
sampling_paramsto only be setup ifend_idis None andtokenizeris not None in theLLMAPI. Thanks to the contribution from @mfuntowicz in [LLM] sampling_params should be setup only if end_id is None and tokenizer is not None #2573.Infrastructure Changes
nvcr.io/nvidia/pytorch:24.11-py3.nvcr.io/nvidia/tritonserver:24.11-py3.Known Issues
export NCCL_P2P_LEVEL=SYS.We are updating the
mainbranch regularly with new features, bug fixes and performance optimizations. Therelbranch will be updated less frequently, and the exact frequencies depend on your feedback.Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM Release 0.16.0.
Beta Was this translation helpful? Give feedback.
All reactions