-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][chore] Weekly mass integration of release/1.1 #9343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis PR encompasses kernel optimization tuning (FMHA v2, Cutlass heuristics), KV cache memory management improvements with new persistent connector, PyTorch 2.9+ Dynamo compatibility fixes, memory profiling infrastructure, sampler enhancements, documentation updates, model support matrix expansions, and broad test coverage adjustments. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Leader as PersistentKvCacheConnectorLeader
participant Worker as PersistentKvCacheConnectorWorker
participant Disk as Disk Storage
participant GPU as GPU Memory
App->>Leader: Request KV cache connector
Leader->>Leader: Compute block hashes
rect rgb(220, 240, 255)
Note over Leader: Generation 1
App->>Worker: Register KV cache tensor
Worker->>GPU: Hold tensor reference
App->>Leader: New blocks to load
Leader->>Disk: Query cached blocks
Disk-->>Leader: Block data
Leader->>Worker: Load blocks command
Worker->>GPU: Load from disk → GPU
App->>Leader: Blocks to save
Leader->>Worker: Save blocks command
Worker->>Disk: Write blocks to disk
end
rect rgb(240, 255, 220)
Note over Leader: Generation 2 (cross-instance)
App->>Worker: Register KV cache tensor
App->>Leader: Load same prompt blocks
Leader->>Disk: Query cached blocks
Disk-->>Leader: Block data
Leader->>Worker: Load blocks command
Worker->>GPU: Load from disk → GPU
Worker-->>App: Cache hit - fast reuse
end
sequenceDiagram
participant User as PyTorch Code
participant Export as torch.onnx.export
participant Dynamo as PyTorch Dynamo (≥2.9.0)
participant ONNX as ONNX Exporter
User->>Export: Call with dynamo=False, opset_version=17
Export->>Dynamo: Dynamo disabled (default skip)
Export->>ONNX: Use standard exporter
ONNX-->>User: ✓ Successful export
rect rgb(255, 240, 220)
Note over User,ONNX: Previous behavior (issue)
User->>Export: Call with opset_version=17 (no dynamo arg)
Export->>Dynamo: Dynamo enabled (default in 2.9+)
Dynamo-->>Export: ✗ Opset 17 incompatibility
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/llmapi/mpi_session.py (1)
562-603: Remove duplicate MPI variable prefixes.The list of MPI variable prefixes contains duplicates:
'OMPI_','PMIX_','PMI_', and'MPI_'each appear twice (lines 580-589 and again in 584-588). This is inefficient and likely a copy-paste error.Apply this diff to remove the duplicates:
mpi_vars = set( itertools.chain([ var for var in current_env if var.startswith(( - 'MPI_', 'OMPI_', 'PMIX_', 'PMI_', - 'OMPI_', - 'PMIX_', - 'PMI_', 'SLURM_', 'MPI_', 'UCX_', 'I_MPI_', 'HYDRA_', 'KMP_', 'MPICH_', 'MV2_', 'CRAY_', )) ], mpi_env_keys or []))tests/unittest/_torch/sampler/test_trtllm_sampler.py (1)
15-32: Fix duplicatesampler_typekeyword and align helper docstring.
LLM(...)currently passessampler_typetwice:cuda_graph_config=CudaGraphConfig(), sampler_type=sampler_type, kv_cache_config=trt_kv_cache_config, sampler_type="TRTLLMSampler",This is invalid Python syntax and prevents the test module from even importing. It also hard‑codes the TRTLLM sampler, defeating the new configurability.
You only need the dynamic
sampler_typeargument. While here, it’s clearer ifcreate_llm’s docstring reflects that it’s about the sampler, not overlap scheduling.def _create_llm_base(model_dir, enable_trtllm_sampler): """Base LLM creation with configurable sampler.""" sampler_type = "TRTLLMSampler" if enable_trtllm_sampler else "TorchSampler" @@ return LLM( model=str(model_dir), tensor_parallel_size=1, trust_remote_code=True, enable_chunked_prefill=True, cuda_graph_config=CudaGraphConfig(), - sampler_type=sampler_type, - kv_cache_config=trt_kv_cache_config, - sampler_type="TRTLLMSampler", + sampler_type=sampler_type, + kv_cache_config=trt_kv_cache_config, max_num_tokens= 128 # Only one request longer than max_num_tokens is required to test chunked prefill ) @@ -def create_llm(model_dir): - """Create LLM with specific overlap scheduler setting""" - return _create_llm_base(model_dir, enable_trtllm_sampler=True) +def create_llm(model_dir): + """Create LLM with TRTLLM sampler enabled.""" + return _create_llm_base(model_dir, enable_trtllm_sampler=True)Also applies to: 35-42
cpp/tensorrt_llm/common/opUtils.cpp (1)
213-236: Fix race in observer cleanup before dereferencing.
mObserversis nulled insidePerCudaCtxPerThreadSingletonCreator::~PerCudaCtxPerThreadSingletonCreator()while holdingmMutex. In the deleter we checkmObserversbefore locking, but if the destructor wins the race after that check yet before we acquire the lock,mObserversbecomesnullptrand the subsequentmObservers->find(key)dereferences a null pointer, crashing during process teardown. Re-check inside the critical section (and bail early) before touching the map.- std::lock_guard<std::mutex> lk{mMutex}; - // Must check observer again because another thread may created new instance for this ctx and this - // thread just before we lock mMutex. We can't infer that the observer is stale from the fact that - // obj is destroyed, because shared_ptr ref-count checking and observer removing are not in one - // atomic operation, and the observer may be changed to observe another instance. - auto it = mObservers->find(key); + std::lock_guard<std::mutex> lk{mMutex}; + if (mObservers == nullptr) + { + return; + } + // Must check observer again because another thread may created new instance for this ctx and this + // thread just before we lock mMutex. We can't infer that the observer is stale from the fact that + // obj is destroyed, because shared_ptr ref-count checking and observer removing are not in one + // atomic operation, and the observer may be changed to observe another instance. + auto it = mObservers->find(key);
🧹 Nitpick comments (10)
tests/integration/test_lists/test-db/l0_sanity_check.yml (1)
28-28: Explicit timeout annotation added for speculative decoding test.The change adds a 90-minute timeout annotation to
test_llmapi_speculative_decoding_mtp. This is appropriate for tests that perform heavy computational work (e.g., speculative decoding with multi-token prediction).Observation: The related speculative decoding tests on lines 29–30 (
eagle3,ngram) do not have explicit timeout annotations. If these tests are similarly compute-intensive, consider adding timeouts to them as well for consistency.tensorrt_llm/llmapi/mpi_session.py (1)
544-548: Remove redundant imports and consider platform compatibility.The function has the following issues:
osis already imported at line 3, so the import inside the function is redundant. Thetempfileanduuidimports should be moved to the top of the file for consistency.- Despite the name suggesting it finds a "free" address, the function only generates a UUID-based path without verifying availability or checking platform compatibility (ZMQ IPC is Unix-specific and may not work on Windows).
Apply this diff to remove redundant imports:
+import tempfile +import uuid + import zmq ... def find_free_ipc_addr() -> str: - import os - import tempfile - import uuid return f'ipc://{os.path.join(tempfile.gettempdir(), "rpc_" + str(uuid.uuid4()))}'Consider adding a docstring that clarifies this function generates a unique IPC address path but doesn't verify platform compatibility or actual availability.
tensorrt_llm/commands/serve.py (1)
702-706: Update or remove outdated TODO comment.The TODO comment on line 702 mentions "Make the port allocation atomic," but the code has migrated from TCP ports to IPC addresses. UUID-based IPC address generation is already effectively atomic (collision probability is negligible), making this TODO comment obsolete.
Apply this diff to remove the outdated comment:
- # This mimics the behavior of trtllm-llmapi-launch - # TODO: Make the port allocation atomic + # This mimics the behavior of trtllm-llmapi-launch with IPC-based communication free_ipc_addr = find_free_ipc_addr()tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
953-978: Consider documenting the output_dtype override behavior.The
forward_fakemethod accepts anoutput_dtypeparameter but always overrides it withtorch.bfloat16when calling the superclass. This mirrors the pattern inTRTLLMGenFusedMoE(see relevant_code_snippets lines 834-864), suggesting it's a backend constraint, but it could be confusing for callers who pass a different dtype expecting it to be respected.Consider either:
- Documenting this override behavior in a docstring or comment
- Removing the parameter from the signature if it's never used (though this would break API compatibility)
- Adding an assertion or warning when
output_dtypeis provided and differs fromtorch.bfloat16Based on learnings
If this override is intentional due to backend limitations, you can add a brief comment:
def forward_fake( self, x: Union[torch.Tensor, Fp4QuantizedTensor], router_logits: torch.Tensor, *, do_finalize: bool = True, output_dtype: Optional[torch.dtype] = None, all_rank_num_tokens: Optional[List[int]] = None, use_dp_padding: Optional[bool] = None, **kwargs, ) -> Union[torch.Tensor, List[torch.Tensor]]: + # WideEPMoE only supports bfloat16 output in forward_fake moe_output = super().forward_fake( x, router_logits, do_finalize=do_finalize, output_dtype=torch.bfloat16, all_rank_num_tokens=all_rank_num_tokens, use_dp_padding=use_dp_padding, **kwargs)tests/unittest/_torch/modules/test_fused_moe.py (1)
501-504: Consider addingstrict=Trueto zip for defensive programming.The
zip()call on line 503 unpacks tuples to map arguments to the executor. While the list comprehension ensures all tuples have the same length, addingstrict=True(Python 3.10+) makes the intent explicit and provides early detection if the pattern changes in the future.Apply this diff if the codebase targets Python 3.10+:
results = executor.map( per_rank_test_fused_moe_alltoall, - *zip(*[(i, weights_world[i], x_list_world[i]) - for i in range(world_size)])) + *zip(*[(i, weights_world[i], x_list_world[i]) + for i in range(world_size)], strict=True))tensorrt_llm/_torch/pyexecutor/sampler.py (1)
714-716: Consider using a set for faster membership check.The fast path uses
new_token in stop_words_list, which performs an O(n) linear search. Convertingstop_words_listto a set before the check would provide O(1) lookup, especially beneficial when there are many single-token stop words.# Fast path: all stop words are single tokens if max_stop_word_length == 1: - return new_token in stop_words_list + return new_token in set(stop_words_list)tests/unittest/_torch/sampler/test_trtllm_sampler.py (2)
84-118: Clarify which sampler this stop‑token test is exercising.
test_trtllm_sampler_with_stop_token_idscallscreate_llm_with_torch_sampler, so it actually runs withTorchSampler, despite the test name and the emphasis on a “fast path optimization” (which sounds like TRTLLM‑specific behavior).If the intent is to validate the TRTLLM sampler’s stop‑token fast path, consider switching to
create_llm(model_path). If instead you meant this to be a TorchSampler regression test, renaming the test (and optionally its docstring) would avoid confusion and better document coverage.- llm = create_llm_with_torch_sampler(model_path) + llm = create_llm(model_path)Alternatively, keep the implementation but rename to something like
test_torch_sampler_with_stop_token_idsif TorchSampler is the intended target.
120-149: Multi‑token stop‑word test looks good; consider stronger assertion if needed.The TorchSampler multi‑token stop‑word test is well‑structured: it explicitly verifies that
stop_stringtokenizes to multiple tokens and asserts that the returned text is non‑empty and does not contain the stop string.One limitation is that this doesn’t guarantee the stop condition was actually triggered (the model might simply never emit
"\n\n"). If you need stronger coverage, you could additionally inspect token sequences or log/debug internal stop‑word hits, but that’s optional given current scope.tests/integration/test_lists/qa/llm_function_nim.txt (1)
401-404: Updated multimodal quickstart node ids look sane—verify againsttest_e2e.pyThe four
test_ptp_quickstart_multimodal_*entries now use the...-imagevariants without the older numeric suffixes, which aligns with the surrounding naming patterns.Please double‑check that
test_e2e.pydefines these exact parametrized ids so the scheduler won’t point at stale node names.tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
719-730: Logic correctly enforces memory limits during estimation.The condition at line 721 now applies the
minconstraint when eitherfree_gpu_memory_fractionis set ORenforce_memory_limitisTrue. This ensures that during KV cache estimation, computed memory limits are respected even iffree_gpu_memory_fractionisNone.The warning message at lines 723-725 might be slightly misleading when
enforce_memory_limit=Truebutfree_gpu_memory_fraction=None, though in practicefree_gpu_memory_fractionis typically set during estimation. Consider updating the warning message to reflect the new enforcement condition.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (49)
README.md(1 hunks)cpp/kernels/fmha_v2/setup.py(1 hunks)cpp/tensorrt_llm/common/opUtils.cpp(5 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp(1 hunks)cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp(1 hunks)docs/source/blogs/H100vsA100.md(1 hunks)docs/source/blogs/H200launch.md(1 hunks)docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md(1 hunks)docs/source/features/disagg-serving.md(1 hunks)docs/source/index.rst(1 hunks)docs/source/legacy/reference/multimodal-feature-support-matrix.md(1 hunks)docs/source/models/supported-models.md(1 hunks)docs/source/overview.md(1 hunks)docs/source/quick-start-guide.md(1 hunks)examples/llm-api/extra-llm-api-config.yml(1 hunks)examples/llm-api/llm_kv_cache_connector.py(5 hunks)examples/llm-api/llm_mgmn_llm_distributed.sh(1 hunks)examples/models/contrib/dit/vae_decoder_trt.py(1 hunks)examples/models/core/multimodal/README.md(1 hunks)examples/models/core/qwenvl/vit_onnx_trt.py(1 hunks)examples/sample_weight_stripping/README.md(1 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(2 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(4 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py(2 hunks)tensorrt_llm/commands/serve.py(2 hunks)tensorrt_llm/llmapi/mpi_session.py(1 hunks)tensorrt_llm/tools/multimodal_builder.py(1 hunks)tests/integration/defs/accuracy/references/mmmu.yaml(1 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(4 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(4 hunks)tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py(1 hunks)tests/integration/defs/test_e2e.py(10 hunks)tests/integration/test_lists/qa/llm_function_core.txt(4 hunks)tests/integration/test_lists/qa/llm_function_l20.txt(1 hunks)tests/integration/test_lists/qa/llm_function_multinode.txt(1 hunks)tests/integration/test_lists/qa/llm_function_nim.txt(2 hunks)tests/integration/test_lists/qa/llm_function_rtx6k.txt(1 hunks)tests/integration/test_lists/test-db/l0_a10.yml(1 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml(0 hunks)tests/integration/test_lists/test-db/l0_h100.yml(1 hunks)tests/integration/test_lists/test-db/l0_sanity_check.yml(1 hunks)tests/integration/test_lists/waives.txt(0 hunks)tests/unittest/_torch/modules/test_fused_moe.py(5 hunks)tests/unittest/_torch/sampler/test_trtllm_sampler.py(3 hunks)tests/unittest/llmapi/apps/openai_server.py(1 hunks)
💤 Files with no reviewable changes (2)
- tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml
- tests/integration/test_lists/waives.txt
🧰 Additional context used
🧠 Learnings (49)
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
docs/source/features/disagg-serving.mddocs/source/blogs/H100vsA100.mddocs/source/quick-start-guide.mddocs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.mdREADME.mddocs/source/blogs/H200launch.md
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
docs/source/features/disagg-serving.mddocs/source/overview.mdREADME.md
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
docs/source/blogs/H100vsA100.mddocs/source/overview.mdREADME.mdtests/integration/test_lists/test-db/l0_a10.ymltests/integration/test_lists/qa/llm_function_l20.txttests/integration/test_lists/test-db/l0_sanity_check.ymltests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/qa/llm_function_rtx6k.txttests/integration/test_lists/test-db/l0_h100.ymltests/integration/test_lists/qa/llm_function_multinode.txttests/integration/defs/accuracy/test_llm_api_pytorch.pytests/unittest/_torch/sampler/test_trtllm_sampler.pytests/integration/test_lists/qa/llm_function_nim.txttests/integration/defs/test_e2e.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
docs/source/blogs/H100vsA100.mddocs/source/overview.mddocs/source/quick-start-guide.mdREADME.mdtests/integration/test_lists/test-db/l0_a10.ymltests/integration/test_lists/qa/llm_function_l20.txttests/integration/test_lists/test-db/l0_sanity_check.ymltests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/qa/llm_function_rtx6k.txttests/integration/test_lists/test-db/l0_h100.ymltests/integration/test_lists/qa/llm_function_multinode.txttests/integration/defs/accuracy/test_llm_api_pytorch.pytests/unittest/_torch/sampler/test_trtllm_sampler.pytests/integration/test_lists/qa/llm_function_nim.txttests/integration/defs/test_e2e.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
docs/source/blogs/H100vsA100.mddocs/source/overview.mdREADME.mdtests/integration/test_lists/test-db/l0_a10.ymldocs/source/blogs/H200launch.mdtests/integration/test_lists/qa/llm_function_rtx6k.txttests/integration/test_lists/qa/llm_function_multinode.txttests/unittest/_torch/sampler/test_trtllm_sampler.pytests/integration/test_lists/qa/llm_function_nim.txt
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
docs/source/blogs/H100vsA100.mddocs/source/overview.mdREADME.mddocs/source/blogs/H200launch.md
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
docs/source/blogs/H100vsA100.mdcpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
docs/source/blogs/H100vsA100.md
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.
Applied to files:
docs/source/overview.mdREADME.md
📚 Learning: 2025-09-18T05:41:45.847Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7120
File: tensorrt_llm/llmapi/llm.py:690-697
Timestamp: 2025-09-18T05:41:45.847Z
Learning: Kimi model support is currently focused on the PyTorch backend path, with TRT path support potentially coming later.
Applied to files:
docs/source/overview.md
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
docs/source/overview.mdREADME.mdcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
docs/source/overview.mdREADME.md
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
docs/source/overview.mdREADME.md
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
📚 Learning: 2025-08-27T17:50:13.264Z
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
Applied to files:
README.md
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
README.md
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.
Applied to files:
README.md
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
examples/llm-api/llm_mgmn_llm_distributed.shtensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/resource_manager.pycpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.
Applied to files:
tests/integration/test_lists/test-db/l0_a10.ymltests/integration/test_lists/test-db/l0_sanity_check.ymltests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/qa/llm_function_rtx6k.txttests/integration/test_lists/test-db/l0_h100.ymltests/integration/test_lists/qa/llm_function_multinode.txttests/integration/test_lists/qa/llm_function_nim.txt
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/integration/test_lists/test-db/l0_a10.ymltests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/qa/llm_function_rtx6k.txttests/integration/test_lists/test-db/l0_h100.ymltests/integration/test_lists/qa/llm_function_multinode.txttests/integration/defs/accuracy/test_llm_api_pytorch.pytests/unittest/_torch/sampler/test_trtllm_sampler.pytests/integration/defs/test_e2e.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
tests/integration/test_lists/test-db/l0_a10.ymltests/integration/defs/test_e2e.py
📚 Learning: 2025-08-09T02:04:49.623Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.
Applied to files:
tests/integration/test_lists/qa/llm_function_l20.txttests/integration/test_lists/qa/llm_function_core.txt
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Applied to files:
tests/integration/test_lists/qa/llm_function_l20.txttests/integration/test_lists/qa/llm_function_core.txt
📚 Learning: 2025-08-13T11:07:11.772Z
Learnt from: Funatiq
Repo: NVIDIA/TensorRT-LLM PR: 6754
File: tests/integration/test_lists/test-db/l0_a30.yml:41-47
Timestamp: 2025-08-13T11:07:11.772Z
Learning: In TensorRT-LLM test configuration files like tests/integration/test_lists/test-db/l0_a30.yml, TIMEOUT values are specified in minutes, not seconds.
Applied to files:
tests/integration/test_lists/test-db/l0_sanity_check.ymltests/integration/test_lists/qa/llm_function_core.txttests/integration/defs/accuracy/test_disaggregated_serving.pytests/integration/test_lists/qa/llm_function_multinode.txt
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
docs/source/legacy/reference/multimodal-feature-support-matrix.md
📚 Learning: 2025-10-22T06:53:47.017Z
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.
Applied to files:
tests/integration/test_lists/qa/llm_function_rtx6k.txt
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
tests/integration/test_lists/qa/llm_function_rtx6k.txtcpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-09-17T06:01:01.836Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7785
File: tests/integration/defs/perf/utils.py:321-333
Timestamp: 2025-09-17T06:01:01.836Z
Learning: In test infrastructure code for disaggregated serving tests, prefer logging errors and continuing execution rather than raising exceptions on timeout, to avoid disrupting test cleanup and causing cascading failures.
Applied to files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/kernels/fmhaDispatcher.cppexamples/llm-api/extra-llm-api-config.yml
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
cpp/tensorrt_llm/kernels/fmhaDispatcher.cppcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fmhaDispatcher.cppcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.pytests/unittest/_torch/modules/test_fused_moe.pycpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.pytests/unittest/_torch/modules/test_fused_moe.py
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
Repo: NVIDIA/TensorRT-LLM PR: 6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/common/opUtils.cpp
📚 Learning: 2025-08-25T00:03:39.294Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1185-1189
Timestamp: 2025-08-25T00:03:39.294Z
Learning: TLLM_CHECK_WITH_INFO is a host-side utility function and cannot be called from CUDA device functions (those marked with __device__ or __global__). In device code, assert() is the primary mechanism for handling "should never happen" conditions, and like standard C++ assert, CUDA's assert only works in debug builds and is compiled out in release builds.
Applied to files:
cpp/tensorrt_llm/common/opUtils.cpp
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
Applied to files:
tests/unittest/_torch/modules/test_fused_moe.py
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.pyexamples/llm-api/extra-llm-api-config.yml
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.pyexamples/llm-api/extra-llm-api-config.yml
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-08-27T15:03:57.149Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.
Applied to files:
tests/unittest/_torch/sampler/test_trtllm_sampler.py
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
examples/llm-api/extra-llm-api-config.yml
🧬 Code graph analysis (9)
cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (2)
tests/unittest/utils/util.py (1)
isSM100Family(98-100)cpp/include/tensorrt_llm/common/cudaUtils.h (1)
isSM100Family(321-325)
tensorrt_llm/commands/serve.py (2)
tensorrt_llm/llmapi/mpi_session.py (1)
find_free_ipc_addr(544-548)tensorrt_llm/executor/utils.py (1)
LlmLauncherEnvs(22-29)
examples/llm-api/llm_kv_cache_connector.py (1)
tensorrt_llm/llmapi/llm.py (2)
LLM(1101-1117)generate(259-341)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
KvCacheConfig(1426-1570)cpp/tensorrt_llm/executor/kvCacheConfig.cpp (1)
KvCacheConfig(24-73)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
forward_fake(749-768)tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
forward_fake(835-865)tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
forward_fake(503-520)AlltoallMethodType(26-34)tensorrt_llm/_torch/utils.py (2)
Fp4QuantizedTensor(125-132)shape(131-132)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (2)
do_finalize(304-305)top_k(275-275)tensorrt_llm/_torch/models/modeling_qwen3_moe.py (1)
routing_method(67-77)
tests/unittest/_torch/modules/test_fused_moe.py (4)
tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
DefaultMoeRoutingMethod(184-214)tensorrt_llm/mapping.py (1)
Mapping(351-510)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
forward_fake(953-978)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
forward_fake(503-520)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest(437-662)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tensorrt_llm/llmapi/llm_args.py (4)
KvCacheConfig(1426-1570)CudaGraphConfig(102-159)MoeConfig(373-407)MTPDecodingConfig(974-1025)tensorrt_llm/quantization/mode.py (1)
QuantAlgo(23-47)tests/integration/defs/accuracy/accuracy_core.py (3)
evaluate(184-247)evaluate(868-878)MMMU(386-403)
tests/unittest/_torch/sampler/test_trtllm_sampler.py (3)
tests/unittest/_torch/executor/test_overlap_scheduler.py (1)
create_llm(24-41)tensorrt_llm/sampling_params.py (1)
SamplingParams(113-540)tensorrt_llm/_torch/auto_deploy/shim/demollm.py (2)
shutdown(331-333)stop(48-51)
🪛 LanguageTool
examples/sample_weight_stripping/README.md
[style] ~244-~244: Try using a synonym here to elevate your writing.
Context: ...hitecture/checkpoint.html). Since these make up the vast majority of weights, the prune...
(CONSTITUTE_COMPRISE)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
86-86: Unused method argument: kv_cache_config
(ARG002)
87-87: Unused method argument: head_dim
(ARG002)
88-88: Unused method argument: tokens_per_block
(ARG002)
89-89: Unused method argument: mapping
(ARG002)
90-90: Unused method argument: dtype
(ARG002)
91-91: Unused method argument: kv_factor
(ARG002)
92-92: Unused method argument: enforce_memory_limit
(ARG002)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
705-705: Unused method argument: kv_factor
(ARG002)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
959-959: Unused method argument: output_dtype
(ARG002)
tests/unittest/_torch/modules/test_fused_moe.py
503-504: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
cpp/kernels/fmha_v2/setup.py
6408-6408: Avoid equality comparisons to False; use not kspec.cross_mha: for false checks
Replace with not kspec.cross_mha
(E712)
6409-6409: Avoid equality comparisons to True; use kspec.flash_attention: for truth checks
Replace with kspec.flash_attention
(E712)
tests/integration/defs/accuracy/test_llm_api_pytorch.py
4410-4410: Undefined name MMMU
(F821)
4420-4420: Undefined name MMMU
(F821)
tests/unittest/_torch/sampler/test_trtllm_sampler.py
29-29: Duplicate keyword argument "sampler_type"
(invalid-syntax)
bc921f9 to
29c1a0e
Compare
|
/bot run --disable-fail-fast |
195125d to
4e908f2
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #25249 [ run ] triggered by Bot. Commit: |
|
PR_Github #25249 [ run ] completed with state |
thorjohnsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
4e908f2 to
0160972
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #25372 [ run ] triggered by Bot. Commit: |
…coding_mtp (NVIDIA#8832) Signed-off-by: qgai <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…A#8666) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…ist (NVIDIA#8908) Signed-off-by: Yan Chunwei <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…VIDIA#8876) Signed-off-by: Junyi Xu <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…ade to fallback to dynamo=False. (NVIDIA#8917) Signed-off-by: Simeng Liu <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
NVIDIA#8911) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…DIA#8883) Signed-off-by: Jin Li <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
NVIDIA#8780) Signed-off-by: Jin Li <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…md. (NVIDIA#8997) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…ory not sufficient error (NVIDIA#8900) Signed-off-by: Wangshanshan <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…and fix… (NVIDIA#9033) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…ory (NVIDIA#9044) Signed-off-by: Vincent Zhang <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…IDIA#9054) Signed-off-by: peaceh <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…VIDIA#8903) Signed-off-by: Chuang Zhu <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
… for single stop token IDs only (NVIDIA#9014) Signed-off-by: Michal Guzek <[email protected]> Signed-off-by: Michal Guzek <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: leslie-fang25 <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…st (NVIDIA#9158) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Shunkang <[email protected]> Co-authored-by: Shunkang <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…us_summary (NVIDIA#9201) Signed-off-by: qgai <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Wanli Jiang <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…_speculative_decoding_mtp (NVIDIA#9092) Signed-off-by: qgai <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
…VIDIA#9223) Signed-off-by: junq <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
… large model weight loading out time (NVIDIA#9254) Signed-off-by: Wangshanshan <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
NVIDIA#9324) Signed-off-by: Junyi Xu <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
8778b4f to
75a47a1
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #25581 [ run ] triggered by Bot. Commit: |
|
PR_Github #25570 [ run ] completed with state |
Description
PRs explicitly excluded in this round:
Test Coverage
N/A
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.