TensorRT-LLM 0.15.0 Release #2531
Shixiaowei02
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
We are very pleased to announce the 0.15.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
examples/eagle/README.md.trtllm-servecommand to start a FastAPI based server.examples/nemotron_nas/README.md.examples/exaone/README.md.examples/medusa/README.md.Qwen2ForSequenceClassificationmodel architecture.examples/python_plugin/README.md.docs/source/performance/perf-best-practices.mdfor information about the required conditions for embedding sharing.beam_widthto256.examples/multimodal/README.md.examples/prompt_lookup/README.md.examples/llama/README.md.executorAPI. Refer to “executorExampleFastLogits” section inexamples/cpp/executor/README.md.LLM.generatetoLLM.__init__for better generation performance without warmup.nandbest_ofarguments to theSamplingParamsclass. These arguments enable returning multiple generations for a single request.ignore_eos,detokenize,skip_special_tokens,spaces_between_special_tokens, andtruncate_prompt_tokensarguments to theSamplingParamsclass. These arguments enable more control over the tokenizer behavior.enable_prompt_adapterargument to theLLMclass and theprompt_adapter_requestargument for theLLM.generatemethod. These arguments enable prompt tuning.gpt_variantargument to theexamples/gpt/convert_checkpoint.pyfile. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in Passing gpt_variant to model conversion #2352.API Changes
builder_force_num_profilesintrtllm-buildcommand to theBUILDER_FORCE_NUM_PROFILESenvironment variable.BuildConfigclass so that they are aligned with thetrtllm-buildcommand.GptManager.autois used as the default value for--dtypeoption in quantize and checkpoints conversion scripts.gptManagerAPI path ingptManagerBenchmark.beam_widthandnum_return_sequencesarguments to theSamplingParamsclass in the LLM API. Use then,best_ofanduse_beam_searcharguments instead.--trust_remote_codeargument to the OpenAI API server. (openai_server error #2357)Model Updates
examples/mllama/README.mdfor more details on the llama 3.2-Vision model.examples/deepseek_v2/README.md.examples/commandr/README.md.examples/falcon/README.md, thanks to the contribution from @puneeshkhanna in Add support for falcon2 #1926.examples/multimodal/README.md.examples/nemotron.examples/gpt/README.md.examples/multimodal/README.md.Fixed Issues
moeTopK()cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.crossKvCacheFraction. (Assertion failed: Must set crossKvCacheFraction for encoder-decoder model #2419)docs/source/performance/perf-benchmarking.md, thanks @MARD1NO for pointing it out in Small Typo #2425.Infrastructure Changes
nvcr.io/nvidia/pytorch:24.10-py3.nvcr.io/nvidia/tritonserver:24.10-py3.Documentation
We are updating the
mainbranch regularly with new features, bug fixes and performance optimizations. Therelbranch will be updated less frequently, and the exact frequencies depend on your feedback.Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM 0.15.0 Release.
Beta Was this translation helpful? Give feedback.
All reactions