Skip to content

Conversation

@mikeiovine
Copy link
Collaborator

@mikeiovine mikeiovine commented Nov 20, 2025

Description

PRs explicitly excluded in this round:

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Added persistent KV cache connector for cross-instance cache reuse
    • Expanded multimodal model support (Phi-4, Mistral-Small-3.1, Qwen2.5 VL)
    • Added FP8 quantized model deployment guidance
  • Bug Fixes

    • Improved memory management with configurable GPU memory limits for KV cache
    • Optimized stop token evaluation with early-exit for single-token stops
    • Enhanced CUDA memory handling during graph capture
  • Documentation

    • Updated hyperlinks and performance documentation references
    • Expanded multimodal model feature support matrix
    • Added quick-start examples for FP8-quantized models
  • Tests

    • Extended test timeouts for complex multi-GPU scenarios
    • Added new test coverage for Phi-4 multimodal fused vision configurations

✏️ Tip: You can customize this high-level summary in your review settings.

@mikeiovine mikeiovine requested review from a team as code owners November 20, 2025 21:25
@mikeiovine mikeiovine changed the title Mass integrate 1.1 [None][chore] Weekly mass integration of release/1.1 Nov 20, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 20, 2025

📝 Walkthrough

Walkthrough

This PR encompasses kernel optimization tuning (FMHA v2, Cutlass heuristics), KV cache memory management improvements with new persistent connector, PyTorch 2.9+ Dynamo compatibility fixes, memory profiling infrastructure, sampler enhancements, documentation updates, model support matrix expansions, and broad test coverage adjustments.

Changes

Cohort / File(s) Change Summary
Kernel Optimizations
cpp/kernels/fmha_v2/setup.py, cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp, cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
FMHA v2 adds Gemma3 VL head_size 72 support; Cutlass heuristic reorders FP8 GROUPED_GEMM tile configs for SM89/120+; FMHA dispatcher excludes head sizes 72 and 80 from TRTLLM-GEN path.
Memory Management & Profiling
cpp/tensorrt_llm/common/opUtils.cpp
Introduces per-thread observer map lifecycle management with destructor, adds memory profiling utilities (MemoryInfo, getMemoryInfo, logMemoryUsage), augments handle creation with memory logging and error context, replaces raw new with smart pointers.
KV Cache Resource Management
tensorrt_llm/_torch/pyexecutor/resource_manager.py, tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
Adds enforce_memory_limit parameter to KVCacheManager and calculate_max_num_blocks, integrates garbage collection, threads memory enforcement through block calculation logic.
Persistent KV Cache Connector
examples/llm-api/llm_kv_cache_connector.py
Implements PersistentKvCacheConnectorWorker and PersistentKvCacheConnectorLeader with metadata holder for cross-instance KV cache reuse via disk persistence; replaces placeholder with functional connector demo.
Sampler Improvements
tensorrt_llm/_torch/pyexecutor/sampler.py
Adds new_token parameter to stop-token criteria, optimizes single-token stop-word path with early exit, refactors multi-token handling.
IPC & MPI Infrastructure
tensorrt_llm/llmapi/mpi_session.py, tensorrt_llm/commands/serve.py
Adds find_free_ipc_addr() and split_mpi_env() functions; replaces TCP-based port discovery with IPC address for disaggregated leader launcher.
PyTorch 2.9+ Compatibility
examples/models/contrib/dit/vae_decoder_trt.py, examples/models/core/qwenvl/vit_onnx_trt.py, tensorrt_llm/tools/multimodal_builder.py
Adds dynamo=False to torch.onnx.export calls with comments explaining PyTorch >= 2.9.0 Dynamo opset_version=17 incompatibility.
Documentation Updates
README.md, docs/source/blogs/*, docs/source/features/disagg-serving.md, docs/source/overview.md, examples/models/core/multimodal/README.md, examples/sample_weight_stripping/README.md
Updates hyperlink references: TensorRT-LLM overview paths, performance docs versioning, Dynamo backends URL, NeVA toolkit links, developer guide inclusion in index.
Model Support & Feature Matrix
docs/source/models/supported-models.md, docs/source/legacy/reference/multimodal-feature-support-matrix.md
Updates feature flags (KV Cache Reuse, Chunked Prefill) for LavaNext, Nemotron, Phi-4-multimodal, Qwen2 VL variants; renames and consolidates multimodal entries.
Configuration & Scripts
examples/llm-api/extra-llm-api-config.yml, examples/llm-api/llm_mgmn_llm_distributed.sh
Adds YAML config with cuda_graph_config and moe_config; adds --max_batch_size 256 to llm-api-launch invocation.
Quick-Start & API Docs
docs/source/quick-start-guide.md
Adds FP8 model deployment guidance and example trtllm-serve command for FP8-quantized models.
Disaggregated Test Updates
tests/integration/defs/disaggregated/test_disaggregated.py, tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py
Removes skip_warmup parameter from run_disaggregated_benchmark, adds free_gpu_memory_fraction=0.25 to KvCacheConfig in single-GPU tests.
Accuracy & Model Tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/defs/accuracy/references/mmmu.yaml, tests/integration/defs/accuracy/test_disaggregated_serving.py
Adds free_gpu_memory_fraction to FP8 tests, introduces test_nvfp4_multi_gpus_sm120, adds Phi-4-multimodal fused vision LoRA test class, increases Qwen3 timeout to 3600s, adds Phi-4-multimodal MMMU reference accuracy.
End-to-End Tests
tests/integration/defs/test_e2e.py
Reduces parameterization (removes match_ratio/modality), bypasses keyword validation (0.0 match_ratio), adds --kv_cache_fraction flags, adds early-exit paths for flaky models, reorganizes multimodal variants.
Test Infrastructure & Lists
tests/integration/test_lists/*, tests/integration/test_lists/qa/*, tests/integration/test_lists/test-db/*, tests/integration/test_lists/waives.txt
Updates timeouts, adds/removes test entries, removes SKIP markers, adjusts parameterization (removes "-0.6-" suffix), adds Phi-4 fused vision LoRA and SM120 tests.
Unit Tests
tests/unittest/_torch/modules/test_fused_moe.py, tests/unittest/_torch/sampler/test_trtllm_sampler.py, tests/unittest/llmapi/apps/openai_server.py
Increases HIDDEN_SIZE to 4096 and refactors device binding in MOE test; adds sampler factory functions with TRTLLMSampler/TorchSampler wrappers and stop-token tests; increases RemoteOpenAIServer timeout from 600s to 7200s.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Leader as PersistentKvCacheConnectorLeader
    participant Worker as PersistentKvCacheConnectorWorker
    participant Disk as Disk Storage
    participant GPU as GPU Memory

    App->>Leader: Request KV cache connector
    Leader->>Leader: Compute block hashes
    
    rect rgb(220, 240, 255)
    Note over Leader: Generation 1
    App->>Worker: Register KV cache tensor
    Worker->>GPU: Hold tensor reference
    App->>Leader: New blocks to load
    Leader->>Disk: Query cached blocks
    Disk-->>Leader: Block data
    Leader->>Worker: Load blocks command
    Worker->>GPU: Load from disk → GPU
    App->>Leader: Blocks to save
    Leader->>Worker: Save blocks command
    Worker->>Disk: Write blocks to disk
    end
    
    rect rgb(240, 255, 220)
    Note over Leader: Generation 2 (cross-instance)
    App->>Worker: Register KV cache tensor
    App->>Leader: Load same prompt blocks
    Leader->>Disk: Query cached blocks
    Disk-->>Leader: Block data
    Leader->>Worker: Load blocks command
    Worker->>GPU: Load from disk → GPU
    Worker-->>App: Cache hit - fast reuse
    end
Loading
sequenceDiagram
    participant User as PyTorch Code
    participant Export as torch.onnx.export
    participant Dynamo as PyTorch Dynamo (≥2.9.0)
    participant ONNX as ONNX Exporter

    User->>Export: Call with dynamo=False, opset_version=17
    Export->>Dynamo: Dynamo disabled (default skip)
    Export->>ONNX: Use standard exporter
    ONNX-->>User: ✓ Successful export

    rect rgb(255, 240, 220)
    Note over User,ONNX: Previous behavior (issue)
    User->>Export: Call with opset_version=17 (no dynamo arg)
    Export->>Dynamo: Dynamo enabled (default in 2.9+)
    Dynamo-->>Export: ✗ Opset 17 incompatibility
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas requiring extra attention:

  • Memory management in opUtils.cpp: Pointer-based observer map lifecycle, destructor cleanup, and error propagation with memory context require careful review of initialization, access patterns, and teardown correctness.
  • KV cache parameter threading: The enforce_memory_limit parameter propagation across multiple layers (KVCacheManager → calculate_max_num_blocks → block allocation logic) needs verification for consistency and correct memory-limit enforcement semantics.
  • Sampler optimization in sampler.py: The new fast-path for single-token stop words requires validation that the early-exit logic correctly handles edge cases and doesn't skip multi-token stop-word setup.
  • KV cache connector implementation: New persistent connector classes (PersistentKvCacheConnectorWorker/Leader) introduce cache serialization and cross-instance reuse logic that requires validation of correctness, file management, and block hashing.
  • Test parameterization reduction in test_e2e.py: Substantial simplification of multimodal test coverage and introduction of 0.0 match_ratio bypass warrants verification that smoke-test behavior is intentional and doesn't hide regressions.
  • Kernel tuning decisions: Head-size exclusions in FMHA dispatcher and tile reordering in Cutlass heuristic require domain knowledge to validate correctness and performance intent.

Possibly related PRs

Suggested reviewers

  • hchings
  • byshiue
  • niukuo
  • liji-nv
  • symphonylyh
  • yuxianq
  • govind-ramnarayan

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning The PR description is incomplete. It lists excluded PRs and references a mass integration, but does not explain the actual changes being integrated or their purpose. Add a clear description of what changes are being integrated in this mass PR, why they are being integrated, and what the overall objective is beyond just listing excluded PRs.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][chore] Weekly mass integration of release/1.1' clearly summarizes the main change: a mass integration of changes from the 1.1 release branch into the main branch. It is specific, concise, and directly reflects the pull request's purpose.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/llmapi/mpi_session.py (1)

562-603: Remove duplicate MPI variable prefixes.

The list of MPI variable prefixes contains duplicates: 'OMPI_', 'PMIX_', 'PMI_', and 'MPI_' each appear twice (lines 580-589 and again in 584-588). This is inefficient and likely a copy-paste error.

Apply this diff to remove the duplicates:

     mpi_vars = set(
         itertools.chain([
             var for var in current_env if var.startswith((
-                'MPI_',
                 'OMPI_',
                 'PMIX_',
                 'PMI_',
-                'OMPI_',
-                'PMIX_',
-                'PMI_',
                 'SLURM_',
                 'MPI_',
                 'UCX_',
                 'I_MPI_',
                 'HYDRA_',
                 'KMP_',
                 'MPICH_',
                 'MV2_',
                 'CRAY_',
             ))
         ], mpi_env_keys or []))
tests/unittest/_torch/sampler/test_trtllm_sampler.py (1)

15-32: Fix duplicate sampler_type keyword and align helper docstring.

LLM(...) currently passes sampler_type twice:

        cuda_graph_config=CudaGraphConfig(),
        sampler_type=sampler_type,
        kv_cache_config=trt_kv_cache_config,
        sampler_type="TRTLLMSampler",

This is invalid Python syntax and prevents the test module from even importing. It also hard‑codes the TRTLLM sampler, defeating the new configurability.

You only need the dynamic sampler_type argument. While here, it’s clearer if create_llm’s docstring reflects that it’s about the sampler, not overlap scheduling.

 def _create_llm_base(model_dir, enable_trtllm_sampler):
     """Base LLM creation with configurable sampler."""
     sampler_type = "TRTLLMSampler" if enable_trtllm_sampler else "TorchSampler"
@@
     return LLM(
         model=str(model_dir),
         tensor_parallel_size=1,
         trust_remote_code=True,
         enable_chunked_prefill=True,
         cuda_graph_config=CudaGraphConfig(),
-        sampler_type=sampler_type,
-        kv_cache_config=trt_kv_cache_config,
-        sampler_type="TRTLLMSampler",
+        sampler_type=sampler_type,
+        kv_cache_config=trt_kv_cache_config,
         max_num_tokens=
         128  # Only one request longer than max_num_tokens is required to test chunked prefill
     )
@@
-def create_llm(model_dir):
-    """Create LLM with specific overlap scheduler setting"""
-    return _create_llm_base(model_dir, enable_trtllm_sampler=True)
+def create_llm(model_dir):
+    """Create LLM with TRTLLM sampler enabled."""
+    return _create_llm_base(model_dir, enable_trtllm_sampler=True)

Also applies to: 35-42

cpp/tensorrt_llm/common/opUtils.cpp (1)

213-236: Fix race in observer cleanup before dereferencing.

mObservers is nulled inside PerCudaCtxPerThreadSingletonCreator::~PerCudaCtxPerThreadSingletonCreator() while holding mMutex. In the deleter we check mObservers before locking, but if the destructor wins the race after that check yet before we acquire the lock, mObservers becomes nullptr and the subsequent mObservers->find(key) dereferences a null pointer, crashing during process teardown. Re-check inside the critical section (and bail early) before touching the map.

-                    std::lock_guard<std::mutex> lk{mMutex};
-                    // Must check observer again because another thread may created new instance for this ctx and this
-                    // thread just before we lock mMutex. We can't infer that the observer is stale from the fact that
-                    // obj is destroyed, because shared_ptr ref-count checking and observer removing are not in one
-                    // atomic operation, and the observer may be changed to observe another instance.
-                    auto it = mObservers->find(key);
+                    std::lock_guard<std::mutex> lk{mMutex};
+                    if (mObservers == nullptr)
+                    {
+                        return;
+                    }
+                    // Must check observer again because another thread may created new instance for this ctx and this
+                    // thread just before we lock mMutex. We can't infer that the observer is stale from the fact that
+                    // obj is destroyed, because shared_ptr ref-count checking and observer removing are not in one
+                    // atomic operation, and the observer may be changed to observe another instance.
+                    auto it = mObservers->find(key);
🧹 Nitpick comments (10)
tests/integration/test_lists/test-db/l0_sanity_check.yml (1)

28-28: Explicit timeout annotation added for speculative decoding test.

The change adds a 90-minute timeout annotation to test_llmapi_speculative_decoding_mtp. This is appropriate for tests that perform heavy computational work (e.g., speculative decoding with multi-token prediction).

Observation: The related speculative decoding tests on lines 29–30 (eagle3, ngram) do not have explicit timeout annotations. If these tests are similarly compute-intensive, consider adding timeouts to them as well for consistency.

tensorrt_llm/llmapi/mpi_session.py (1)

544-548: Remove redundant imports and consider platform compatibility.

The function has the following issues:

  1. os is already imported at line 3, so the import inside the function is redundant. The tempfile and uuid imports should be moved to the top of the file for consistency.
  2. Despite the name suggesting it finds a "free" address, the function only generates a UUID-based path without verifying availability or checking platform compatibility (ZMQ IPC is Unix-specific and may not work on Windows).

Apply this diff to remove redundant imports:

+import tempfile
+import uuid
+
 import zmq

 ...

 def find_free_ipc_addr() -> str:
-    import os
-    import tempfile
-    import uuid
     return f'ipc://{os.path.join(tempfile.gettempdir(), "rpc_" + str(uuid.uuid4()))}'

Consider adding a docstring that clarifies this function generates a unique IPC address path but doesn't verify platform compatibility or actual availability.

tensorrt_llm/commands/serve.py (1)

702-706: Update or remove outdated TODO comment.

The TODO comment on line 702 mentions "Make the port allocation atomic," but the code has migrated from TCP ports to IPC addresses. UUID-based IPC address generation is already effectively atomic (collision probability is negligible), making this TODO comment obsolete.

Apply this diff to remove the outdated comment:

-    # This mimics the behavior of trtllm-llmapi-launch
-    # TODO: Make the port allocation atomic
+    # This mimics the behavior of trtllm-llmapi-launch with IPC-based communication
     free_ipc_addr = find_free_ipc_addr()
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

953-978: Consider documenting the output_dtype override behavior.

The forward_fake method accepts an output_dtype parameter but always overrides it with torch.bfloat16 when calling the superclass. This mirrors the pattern in TRTLLMGenFusedMoE (see relevant_code_snippets lines 834-864), suggesting it's a backend constraint, but it could be confusing for callers who pass a different dtype expecting it to be respected.

Consider either:

  1. Documenting this override behavior in a docstring or comment
  2. Removing the parameter from the signature if it's never used (though this would break API compatibility)
  3. Adding an assertion or warning when output_dtype is provided and differs from torch.bfloat16

Based on learnings

If this override is intentional due to backend limitations, you can add a brief comment:

 def forward_fake(
         self,
         x: Union[torch.Tensor, Fp4QuantizedTensor],
         router_logits: torch.Tensor,
         *,
         do_finalize: bool = True,
         output_dtype: Optional[torch.dtype] = None,
         all_rank_num_tokens: Optional[List[int]] = None,
         use_dp_padding: Optional[bool] = None,
         **kwargs,
     ) -> Union[torch.Tensor, List[torch.Tensor]]:
+        # WideEPMoE only supports bfloat16 output in forward_fake
         moe_output = super().forward_fake(
             x,
             router_logits,
             do_finalize=do_finalize,
             output_dtype=torch.bfloat16,
             all_rank_num_tokens=all_rank_num_tokens,
             use_dp_padding=use_dp_padding,
             **kwargs)
tests/unittest/_torch/modules/test_fused_moe.py (1)

501-504: Consider adding strict=True to zip for defensive programming.

The zip() call on line 503 unpacks tuples to map arguments to the executor. While the list comprehension ensures all tuples have the same length, adding strict=True (Python 3.10+) makes the intent explicit and provides early detection if the pattern changes in the future.

Apply this diff if the codebase targets Python 3.10+:

         results = executor.map(
             per_rank_test_fused_moe_alltoall,
-            *zip(*[(i, weights_world[i], x_list_world[i])
-                   for i in range(world_size)]))
+            *zip(*[(i, weights_world[i], x_list_world[i])
+                   for i in range(world_size)], strict=True))
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

714-716: Consider using a set for faster membership check.

The fast path uses new_token in stop_words_list, which performs an O(n) linear search. Converting stop_words_list to a set before the check would provide O(1) lookup, especially beneficial when there are many single-token stop words.

         # Fast path: all stop words are single tokens
         if max_stop_word_length == 1:
-            return new_token in stop_words_list
+            return new_token in set(stop_words_list)
tests/unittest/_torch/sampler/test_trtllm_sampler.py (2)

84-118: Clarify which sampler this stop‑token test is exercising.

test_trtllm_sampler_with_stop_token_ids calls create_llm_with_torch_sampler, so it actually runs with TorchSampler, despite the test name and the emphasis on a “fast path optimization” (which sounds like TRTLLM‑specific behavior).

If the intent is to validate the TRTLLM sampler’s stop‑token fast path, consider switching to create_llm(model_path). If instead you meant this to be a TorchSampler regression test, renaming the test (and optionally its docstring) would avoid confusion and better document coverage.

-    llm = create_llm_with_torch_sampler(model_path)
+    llm = create_llm(model_path)

Alternatively, keep the implementation but rename to something like test_torch_sampler_with_stop_token_ids if TorchSampler is the intended target.


120-149: Multi‑token stop‑word test looks good; consider stronger assertion if needed.

The TorchSampler multi‑token stop‑word test is well‑structured: it explicitly verifies that stop_string tokenizes to multiple tokens and asserts that the returned text is non‑empty and does not contain the stop string.

One limitation is that this doesn’t guarantee the stop condition was actually triggered (the model might simply never emit "\n\n"). If you need stronger coverage, you could additionally inspect token sequences or log/debug internal stop‑word hits, but that’s optional given current scope.

tests/integration/test_lists/qa/llm_function_nim.txt (1)

401-404: Updated multimodal quickstart node ids look sane—verify against test_e2e.py

The four test_ptp_quickstart_multimodal_* entries now use the ...-image variants without the older numeric suffixes, which aligns with the surrounding naming patterns.

Please double‑check that test_e2e.py defines these exact parametrized ids so the scheduler won’t point at stale node names.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

719-730: Logic correctly enforces memory limits during estimation.

The condition at line 721 now applies the min constraint when either free_gpu_memory_fraction is set OR enforce_memory_limit is True. This ensures that during KV cache estimation, computed memory limits are respected even if free_gpu_memory_fraction is None.

The warning message at lines 723-725 might be slightly misleading when enforce_memory_limit=True but free_gpu_memory_fraction=None, though in practice free_gpu_memory_fraction is typically set during estimation. Consider updating the warning message to reflect the new enforcement condition.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8dd526 and bc921f9.

📒 Files selected for processing (49)
  • README.md (1 hunks)
  • cpp/kernels/fmha_v2/setup.py (1 hunks)
  • cpp/tensorrt_llm/common/opUtils.cpp (5 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (1 hunks)
  • docs/source/blogs/H100vsA100.md (1 hunks)
  • docs/source/blogs/H200launch.md (1 hunks)
  • docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (1 hunks)
  • docs/source/features/disagg-serving.md (1 hunks)
  • docs/source/index.rst (1 hunks)
  • docs/source/legacy/reference/multimodal-feature-support-matrix.md (1 hunks)
  • docs/source/models/supported-models.md (1 hunks)
  • docs/source/overview.md (1 hunks)
  • docs/source/quick-start-guide.md (1 hunks)
  • examples/llm-api/extra-llm-api-config.yml (1 hunks)
  • examples/llm-api/llm_kv_cache_connector.py (5 hunks)
  • examples/llm-api/llm_mgmn_llm_distributed.sh (1 hunks)
  • examples/models/contrib/dit/vae_decoder_trt.py (1 hunks)
  • examples/models/core/multimodal/README.md (1 hunks)
  • examples/models/core/qwenvl/vit_onnx_trt.py (1 hunks)
  • examples/sample_weight_stripping/README.md (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (2 hunks)
  • tensorrt_llm/commands/serve.py (2 hunks)
  • tensorrt_llm/llmapi/mpi_session.py (1 hunks)
  • tensorrt_llm/tools/multimodal_builder.py (1 hunks)
  • tests/integration/defs/accuracy/references/mmmu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (4 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (4 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1 hunks)
  • tests/integration/defs/test_e2e.py (10 hunks)
  • tests/integration/test_lists/qa/llm_function_core.txt (4 hunks)
  • tests/integration/test_lists/qa/llm_function_l20.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_multinode.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_nim.txt (2 hunks)
  • tests/integration/test_lists/qa/llm_function_rtx6k.txt (1 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (0 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_sanity_check.yml (1 hunks)
  • tests/integration/test_lists/waives.txt (0 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (5 hunks)
  • tests/unittest/_torch/sampler/test_trtllm_sampler.py (3 hunks)
  • tests/unittest/llmapi/apps/openai_server.py (1 hunks)
💤 Files with no reviewable changes (2)
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml
  • tests/integration/test_lists/waives.txt
🧰 Additional context used
🧠 Learnings (49)
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • docs/source/features/disagg-serving.md
  • docs/source/blogs/H100vsA100.md
  • docs/source/quick-start-guide.md
  • docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
  • README.md
  • docs/source/blogs/H200launch.md
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/features/disagg-serving.md
  • docs/source/overview.md
  • README.md
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • docs/source/blogs/H100vsA100.md
  • docs/source/overview.md
  • README.md
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/integration/test_lists/qa/llm_function_l20.txt
  • tests/integration/test_lists/test-db/l0_sanity_check.yml
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/qa/llm_function_rtx6k.txt
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/test_lists/qa/llm_function_multinode.txt
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/unittest/_torch/sampler/test_trtllm_sampler.py
  • tests/integration/test_lists/qa/llm_function_nim.txt
  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/blogs/H100vsA100.md
  • docs/source/overview.md
  • docs/source/quick-start-guide.md
  • README.md
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/integration/test_lists/qa/llm_function_l20.txt
  • tests/integration/test_lists/test-db/l0_sanity_check.yml
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/qa/llm_function_rtx6k.txt
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/test_lists/qa/llm_function_multinode.txt
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/unittest/_torch/sampler/test_trtllm_sampler.py
  • tests/integration/test_lists/qa/llm_function_nim.txt
  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • docs/source/blogs/H100vsA100.md
  • docs/source/overview.md
  • README.md
  • tests/integration/test_lists/test-db/l0_a10.yml
  • docs/source/blogs/H200launch.md
  • tests/integration/test_lists/qa/llm_function_rtx6k.txt
  • tests/integration/test_lists/qa/llm_function_multinode.txt
  • tests/unittest/_torch/sampler/test_trtllm_sampler.py
  • tests/integration/test_lists/qa/llm_function_nim.txt
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • docs/source/blogs/H100vsA100.md
  • docs/source/overview.md
  • README.md
  • docs/source/blogs/H200launch.md
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

  • docs/source/blogs/H100vsA100.md
  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

  • docs/source/blogs/H100vsA100.md
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

  • docs/source/overview.md
  • README.md
📚 Learning: 2025-09-18T05:41:45.847Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7120
File: tensorrt_llm/llmapi/llm.py:690-697
Timestamp: 2025-09-18T05:41:45.847Z
Learning: Kimi model support is currently focused on the PyTorch backend path, with TRT path support potentially coming later.

Applied to files:

  • docs/source/overview.md
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

  • docs/source/overview.md
  • README.md
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • docs/source/overview.md
  • README.md
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

  • docs/source/overview.md
  • README.md
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
📚 Learning: 2025-08-27T17:50:13.264Z
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Applied to files:

  • README.md
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

  • README.md
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

  • README.md
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • examples/llm-api/llm_mgmn_llm_distributed.sh
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.

Applied to files:

  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/integration/test_lists/test-db/l0_sanity_check.yml
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/qa/llm_function_rtx6k.txt
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/test_lists/qa/llm_function_multinode.txt
  • tests/integration/test_lists/qa/llm_function_nim.txt
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/qa/llm_function_rtx6k.txt
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/test_lists/qa/llm_function_multinode.txt
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/unittest/_torch/sampler/test_trtllm_sampler.py
  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-09T02:04:49.623Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_l20.txt
  • tests/integration/test_lists/qa/llm_function_core.txt
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_l20.txt
  • tests/integration/test_lists/qa/llm_function_core.txt
📚 Learning: 2025-08-13T11:07:11.772Z
Learnt from: Funatiq
Repo: NVIDIA/TensorRT-LLM PR: 6754
File: tests/integration/test_lists/test-db/l0_a30.yml:41-47
Timestamp: 2025-08-13T11:07:11.772Z
Learning: In TensorRT-LLM test configuration files like tests/integration/test_lists/test-db/l0_a30.yml, TIMEOUT values are specified in minutes, not seconds.

Applied to files:

  • tests/integration/test_lists/test-db/l0_sanity_check.yml
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/integration/test_lists/qa/llm_function_multinode.txt
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • docs/source/legacy/reference/multimodal-feature-support-matrix.md
📚 Learning: 2025-10-22T06:53:47.017Z
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_rtx6k.txt
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_rtx6k.txt
  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-09-17T06:01:01.836Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7785
File: tests/integration/defs/perf/utils.py:321-333
Timestamp: 2025-09-17T06:01:01.836Z
Learning: In test infrastructure code for disaggregated serving tests, prefer logging errors and continuing execution rather than raising exceptions on timeout, to avoid disrupting test cleanup and causing cascading failures.

Applied to files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
  • examples/llm-api/extra-llm-api-config.yml
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tests/unittest/_torch/modules/test_fused_moe.py
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
Repo: NVIDIA/TensorRT-LLM PR: 6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/common/opUtils.cpp
📚 Learning: 2025-08-25T00:03:39.294Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1185-1189
Timestamp: 2025-08-25T00:03:39.294Z
Learning: TLLM_CHECK_WITH_INFO is a host-side utility function and cannot be called from CUDA device functions (those marked with __device__ or __global__). In device code, assert() is the primary mechanism for handling "should never happen" conditions, and like standard C++ assert, CUDA's assert only works in debug builds and is compiled out in release builds.

Applied to files:

  • cpp/tensorrt_llm/common/opUtils.cpp
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

  • tests/unittest/_torch/modules/test_fused_moe.py
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • examples/llm-api/extra-llm-api-config.yml
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • examples/llm-api/extra-llm-api-config.yml
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-08-27T15:03:57.149Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

  • tests/unittest/_torch/sampler/test_trtllm_sampler.py
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • examples/llm-api/extra-llm-api-config.yml
🧬 Code graph analysis (9)
cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (2)
tests/unittest/utils/util.py (1)
  • isSM100Family (98-100)
cpp/include/tensorrt_llm/common/cudaUtils.h (1)
  • isSM100Family (321-325)
tensorrt_llm/commands/serve.py (2)
tensorrt_llm/llmapi/mpi_session.py (1)
  • find_free_ipc_addr (544-548)
tensorrt_llm/executor/utils.py (1)
  • LlmLauncherEnvs (22-29)
examples/llm-api/llm_kv_cache_connector.py (1)
tensorrt_llm/llmapi/llm.py (2)
  • LLM (1101-1117)
  • generate (259-341)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
  • KvCacheConfig (1426-1570)
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (1)
  • KvCacheConfig (24-73)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_fake (749-768)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_fake (835-865)
tensorrt_llm/_torch/modules/fused_moe/interface.py (2)
  • forward_fake (503-520)
  • AlltoallMethodType (26-34)
tensorrt_llm/_torch/utils.py (2)
  • Fp4QuantizedTensor (125-132)
  • shape (131-132)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (2)
  • do_finalize (304-305)
  • top_k (275-275)
tensorrt_llm/_torch/models/modeling_qwen3_moe.py (1)
  • routing_method (67-77)
tests/unittest/_torch/modules/test_fused_moe.py (4)
tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
  • DefaultMoeRoutingMethod (184-214)
tensorrt_llm/mapping.py (1)
  • Mapping (351-510)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_fake (953-978)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_fake (503-520)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
  • LlmRequest (437-662)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tensorrt_llm/llmapi/llm_args.py (4)
  • KvCacheConfig (1426-1570)
  • CudaGraphConfig (102-159)
  • MoeConfig (373-407)
  • MTPDecodingConfig (974-1025)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-47)
tests/integration/defs/accuracy/accuracy_core.py (3)
  • evaluate (184-247)
  • evaluate (868-878)
  • MMMU (386-403)
tests/unittest/_torch/sampler/test_trtllm_sampler.py (3)
tests/unittest/_torch/executor/test_overlap_scheduler.py (1)
  • create_llm (24-41)
tensorrt_llm/sampling_params.py (1)
  • SamplingParams (113-540)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (2)
  • shutdown (331-333)
  • stop (48-51)
🪛 LanguageTool
examples/sample_weight_stripping/README.md

[style] ~244-~244: Try using a synonym here to elevate your writing.
Context: ...hitecture/checkpoint.html). Since these make up the vast majority of weights, the prune...

(CONSTITUTE_COMPRISE)

🪛 Ruff (0.14.5)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

86-86: Unused method argument: kv_cache_config

(ARG002)


87-87: Unused method argument: head_dim

(ARG002)


88-88: Unused method argument: tokens_per_block

(ARG002)


89-89: Unused method argument: mapping

(ARG002)


90-90: Unused method argument: dtype

(ARG002)


91-91: Unused method argument: kv_factor

(ARG002)


92-92: Unused method argument: enforce_memory_limit

(ARG002)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

705-705: Unused method argument: kv_factor

(ARG002)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

959-959: Unused method argument: output_dtype

(ARG002)

tests/unittest/_torch/modules/test_fused_moe.py

503-504: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

cpp/kernels/fmha_v2/setup.py

6408-6408: Avoid equality comparisons to False; use not kspec.cross_mha: for false checks

Replace with not kspec.cross_mha

(E712)


6409-6409: Avoid equality comparisons to True; use kspec.flash_attention: for truth checks

Replace with kspec.flash_attention

(E712)

tests/integration/defs/accuracy/test_llm_api_pytorch.py

4410-4410: Undefined name MMMU

(F821)


4420-4420: Undefined name MMMU

(F821)

tests/unittest/_torch/sampler/test_trtllm_sampler.py

29-29: Duplicate keyword argument "sampler_type"

(invalid-syntax)

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@mikeiovine mikeiovine force-pushed the mass-integrate-1.1 branch 2 times, most recently from 195125d to 4e908f2 Compare November 20, 2025 21:50
@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25249 [ run ] triggered by Bot. Commit: 4e908f2

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25249 [ run ] completed with state ABORTED. Commit: 4e908f2
LLM/main/L0_MergeRequest_PR #19098 (Blue Ocean) completed with status: ABORTED

Copy link
Collaborator

@thorjohnsen thorjohnsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JunyiXu-nv
Copy link
Collaborator

Hi michael, please also exclude this one: #9324. Since there is a standalong cherry-pick PR created to try resolving the CI issue: #9346

Thanks!

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25372 [ run ] triggered by Bot. Commit: 0160972

sunnyqgg and others added 26 commits November 24, 2025 11:05
…coding_mtp (NVIDIA#8832)

Signed-off-by: qgai <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…A#8666)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…ist (NVIDIA#8908)

Signed-off-by: Yan Chunwei <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…ade to fallback to dynamo=False. (NVIDIA#8917)

Signed-off-by: Simeng Liu <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Ivy Zhang <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…md. (NVIDIA#8997)

Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…ory not sufficient error (NVIDIA#8900)

Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…and fix… (NVIDIA#9033)

Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…ory (NVIDIA#9044)

Signed-off-by: Vincent Zhang <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
… for single stop token IDs only (NVIDIA#9014)

Signed-off-by: Michal Guzek <[email protected]>
Signed-off-by: Michal Guzek <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: leslie-fang25 <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…st (NVIDIA#9158)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Shunkang <[email protected]>
Co-authored-by: Shunkang <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…us_summary (NVIDIA#9201)

Signed-off-by: qgai <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Wanli Jiang <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
…_speculative_decoding_mtp (NVIDIA#9092)

Signed-off-by: qgai <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
… large model weight loading out time (NVIDIA#9254)

Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25581 [ run ] triggered by Bot. Commit: 75a47a1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25570 [ run ] completed with state ABORTED. Commit: 8778b4f
LLM/main/L0_MergeRequest_PR #19367 (Blue Ocean) completed with status: ABORTED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.