[TRTLLM-9389][chore] Rename AlltoAll backend names #9329

bobboli · 2025-11-20T06:25:42Z

Rename: mnnvlthroughput -> nvlink_one_sided; mnnvllatency -> nvlink_two_sided.
(CPP code namespace) mnnvl_throughput -> moe_comm.

Summary by CodeRabbit

Refactor
- Reorganized internal namespace structure for MoE (Mixture of Experts) communication kernels.
- Updated backend identification system with clearer naming conventions for improved consistency across the framework.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

bobboli · 2025-11-20T06:29:22Z

/bot run

coderabbitai · 2025-11-20T06:30:02Z

📝 Walkthrough

Walkthrough

Namespace refactoring renaming tensorrt_llm::kernels::mnnvl_throughput to tensorrt_llm::kernels::moe_comm across C++ kernel definitions and headers, along with updating MoE all-to-all backend identifiers in Python modules from lowercase mnnvlthroughput/mnnvllatency to uppercase NVLINK_ONE_SIDED/NVLINK_TWO_SIDED.

Changes

Cohort / File(s)	Summary
C++ Namespace Renaming `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu`, `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h`	Namespace updated from `tensorrt_llm::kernels::mnnvl_throughput` to `tensorrt_llm::kernels::moe_comm` in kernel implementation and header declarations.
C++ PyTorch Bindings `cpp/tensorrt_llm/nanobind/thop/bindings.cpp`, `cpp/tensorrt_llm/pybind/thop/bindings.cpp`	Updated MoE A2A metadata export iteration to source from `torch_ext::moe_comm` namespace instead of `torch_ext::mnnvl_throughput`.
C++ MoE A2A Operations `cpp/tensorrt_llm/thop/moeAlltoAllMeta.h`, `cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`	Replaced all namespace references from `mnnvl_throughput` to `moe_comm` in type declarations (PayloadDescriptor, MoeA2ADispatchParams), kernel launchers, constants (kMaxTopK, kMaxPayloads), and PyTorch module bindings.
Python FusedMoE Backend Names `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py`	Replaced backend string literals from lowercase `mnnvlthroughput`/`mnnvllatency` to uppercase `NVLINK_ONE_SIDED`/`NVLINK_TWO_SIDED`; updated environment variable parsing to use `.upper()` instead of `.lower()`; adjusted default backend selection and conditional logic.
Python Documentation `tensorrt_llm/_torch/modules/fused_moe/interface.py`	Updated docstring reference from `MnnvlLatency` to `NVLINK_TWO_SIDED` for ignore_allreduce backend support.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring extra attention:
- Verify all namespace references have been consistently updated across C++ and PyTorch bindings—ensure no stray references to mnnvl_throughput remain
- Confirm the backend string identifier replacements in Python are consistent across all conditional branches and initialization paths (check for the potential typo NVLINK_ONE_SIDEDz mentioned in the summary)
- Validate that getMoeA2AMetaInfoIndexPairs() function exists in the new moe_comm namespace and is properly exported
- Review the environment variable parsing change from .lower() to .upper() for backward compatibility implications

Possibly related PRs

[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. #8728: Both PRs modify MoE all-to-all code paths and symbol namespaces in moeAlltoAll* files and Torch bindings; this PR consolidates symbols under moe_comm namespace while the related PR introduces restructured MoE APIs.

Suggested reviewers

yilin-void
dongxuy04
liji-nv

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. While it mentions the renaming objectives, required sections like 'Description' and 'Test Coverage' contain only template placeholders with no actual content.	Fill in the Description section explaining the rationale for the rename and provide a list of relevant tests in the Test Coverage section.
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.17% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change as a comprehensive renaming of AlltoAll backend names, which matches the changeset scope.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79a6c97 and 074e97b.

📒 Files selected for processing (10)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (2 hunks)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (2 hunks)
cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1 hunks)
cpp/tensorrt_llm/pybind/thop/bindings.cpp (1 hunks)
cpp/tensorrt_llm/thop/moeAlltoAllMeta.h (2 hunks)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (7 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (8 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (8 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1 hunks)

🧰 Additional context used

🧠 Learnings (29)

📓 Common learnings

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/thop/moeAlltoAllMeta.h
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/thop/moeAlltoAllMeta.h
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/modules/fused_moe/interface.py
cpp/tensorrt_llm/pybind/thop/bindings.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/thop/moeAlltoAllMeta.h
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

📚 Learning: 2025-09-23T15:13:48.819Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-09-23T14:58:05.372Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-21T02:41:10.565Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h:141-145
Timestamp: 2025-08-21T02:41:10.565Z
Learning: In TensorRT-LLM MOE GEMM kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h), the stride_act and stride_weight pointers in TmaWarpSpecializedGroupedGemmInput are intentionally declared as void* rather than typed pointers because the actual stride type is determined at runtime based on factors like the swap_ab flag and layout decisions. This runtime type determination makes compile-time type safety impossible, so void* is the correct approach.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllMeta.h
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/modules/fused_moe/interface.py
cpp/tensorrt_llm/pybind/thop/bindings.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/pybind/thop/bindings.cpp
cpp/tensorrt_llm/nanobind/thop/bindings.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-21T21:48:35.135Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/modules/fused_moe/interface.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/modules/fused_moe/interface.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/interface.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/interface.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-10-20T16:54:09.824Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-17T15:07:01.420Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-17T15:07:01.420Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm() (line 51 in lora.cpp), along with setWorkspace(). The stream parameter flows from loraOp.cpp through LoraImpl::run() to _runGemm() where setStream() is called appropriately. Adding setStream() in loraOp.cpp would be redundant and goes against the intended architectural design.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-09-16T09:30:09.716Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

🧬 Code graph analysis (5)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (2)

cpp/tensorrt_llm/common/envUtils.h (1)

tensorrt_llm (25-156)

cpp/tensorrt_llm/thop/moeAlltoAllMeta.h (1)

moe_comm (26-62)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (4)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2)

moe_alltoall_backend (200-203)

enable_alltoall (194-197)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

moe_alltoall_backend (250-253)

enable_alltoall (244-247)

tensorrt_llm/_mnnvl_utils.py (5)

MnnvlMemory (53-338)

initialize (91-100)

MnnvlMoe (352-624)

get_moe_workspaces (360-376)

get_moe_prepare_workspace (379-390)

tensorrt_llm/_torch/modules/fused_moe/interface.py (2)

enable_alltoall (619-622)

AlltoallMethodType (26-34)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (3)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

moe_alltoall_backend (255-258)

enable_alltoall (249-252)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

moe_alltoall_backend (250-253)

enable_alltoall (244-247)

tensorrt_llm/_torch/modules/fused_moe/interface.py (2)

enable_alltoall (619-622)

AlltoallMethodType (26-34)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

enable_alltoall (249-252)

moe_alltoall_backend (255-258)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2)

enable_alltoall (194-197)

moe_alltoall_backend (200-203)

tensorrt_llm/_torch/modules/fused_moe/interface.py (2)

enable_alltoall (619-622)

AlltoallMethodType (26-34)

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (6)

moe_a2a_dispatch_launch (506-571)

moe_a2a_dispatch_launch (506-506)

moe_a2a_combine_launch (880-933)

moe_a2a_combine_launch (880-880)

moe_a2a_sanitize_expert_ids_launch (957-965)

moe_a2a_sanitize_expert_ids_launch (957-958)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (14)

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

349-350: Docstring correctly updated to new backend name

The ignore_allreduce docstring now references NVLINK_TWO_SIDED, matching the new backend identifier and surrounding code. No further changes needed.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

249-253: Backend rename and load‑balancer condition look consistent

moe_alltoall_backend now normalizes to uppercase and defaults to "NVLINK_TWO_SIDED", matching the comment.

The ignore_allreduce guard now checks backend == "NVLINK_TWO_SIDED", which aligns with the renamed low‑latency path.

These changes preserve the prior semantics under the new backend naming.

Also applies to: 439-442

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)

122-143: Backend renames and NVLINK_ONE_SIDED/TWO_SIDED handling look correct

MNNVL init now cleanly splits NVLINK_TWO_SIDED (C++ moe_comm path via MnnvlMoe) vs NVLINK_ONE_SIDED (Python MoeAlltoAll path) using uppercase strings consistent with moe_alltoall_backend.

moe_alltoall_backend normalizes env input with .strip().upper() and defaults to "NVLINK_ONE_SIDED", matching the comment and call sites.

ignore_allreduce for the load balancer is gated only for the NVLINK_TWO_SIDED MNNVL path, consistent with WideEPMoE/Cutlass semantics.

The NVLINK_ONE_SIDED path correctly:

Uses MoeAlltoAll.dispatch/combine,

Optionally allocates a workspace‑backed moe_output for the w4a8_mxfp4_mxfp8 case, and

Branches on use_workspace_output to decide how to pass payloads into combine.

These changes align with the PR’s renaming goals without introducing new behavior issues in this file.

Also applies to: 199-203, 365-371, 397-487, 510-517, 776-811

cpp/tensorrt_llm/thop/moeAlltoAllMeta.h (1)

24-27: Namespace rename to moe_comm is consistent with the new moe A2A APIs

The enclosing namespace has been updated from the old MNNVL‑specific name to moe_comm, and the closing comment matches. All internal declarations remain intact, so this is a straightforward, backward‑compatible namespace refactor from this header’s perspective.

Also applies to: 64-65

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)

26-27: Namespace rename in kernels source is consistent

mnnvl_throughput → moe_comm here matches the header and thop-side changes; no behavioral impact.

Also applies to: 967-967

cpp/tensorrt_llm/pybind/thop/bindings.cpp (1)

33-36: PyBind MoE A2A constants now sourced from moe_comm

Switching to torch_ext::moe_comm::getMoeA2AMetaInfoIndexPairs() aligns this binding with the new namespace, with unchanged export behavior.

cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1)

33-36: NanoBind MoE A2A constants aligned with moe_comm

The binding now reads index pairs from torch_ext::moe_comm, matching the refactored meta header; behavior is preserved.

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)

22-23: Kernel header namespace matches implementation

Placing all MoE A2A types and launches under tensorrt_llm::kernels::moe_comm keeps the API coherent with the .cu file and thop callers.

Also applies to: 180-180

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (6)

31-39: thop-side namespace now clearly scoped as torch_ext::moe_comm

Renaming the inner namespace to moe_comm and updating the closing brace keeps all thop MoE A2A ops grouped under torch_ext::moe_comm, matching the new kernel namespace usage.

Also applies to: 511-513

79-88: Offset calculation now uses moe_comm::kMaxTopK

Referencing tensorrt_llm::kernels::moe_comm::kMaxTopK in calculateOffsets keeps the metainfo layout in sync with the kernel’s compile-time top‑k limit; arithmetic and types remain unchanged.

347-350: Combine op aliases moved to moe_comm kernels

Using MoeA2ACombineParams, moe_a2a_combine_launch, and kMaxTopK from tensorrt_llm::kernels::moe_comm keeps the combine path aligned with the refactored kernel namespace.

477-479: Sanitize expert IDs launch wired to moe_comm kernel function

moe_a2a_sanitize_expert_ids_launch is now called from tensorrt_llm::kernels::moe_comm, matching the new kernel namespace without altering runtime behavior.

543-547: Torch CUDA impl registrations updated to torch_ext::moe_comm

All MoE A2A Torch ops (dispatch, combine, initialize, sanitize_expert_ids, get_combine_payload_tensor) are now registered against torch_ext::moe_comm::*, consistent with the namespace change above.

168-173: Verified: MOE All-to-All dispatch namespaces correctly updated

The verification confirms that the using declarations at lines 168-173 in cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp are correctly pulling from tensorrt_llm::kernels::moe_comm. No stray mnnvl* identifiers exist in the file, and all related function calls use the updated namespace consistently. The rename is complete.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2025-11-20T06:34:51Z

PR_Github #25160 [ run ] triggered by Bot. Commit: 074e97b

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py

bobboli · 2025-11-20T07:05:39Z

/bot kill

tensorrt-cicd · 2025-11-20T07:12:00Z

PR_Github #25163 [ kill ] triggered by Bot. Commit: 074e97b

tensorrt-cicd · 2025-11-20T07:12:05Z

PR_Github #25160 [ run ] completed with state ABORTED. Commit: 074e97b
LLM/main/L0_MergeRequest_PR #19022 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-11-20T07:12:34Z

PR_Github #25163 [ kill ] completed with state SUCCESS. Commit: 074e97b
Successfully killed previous jobs for commit 074e97b

bobboli · 2025-11-20T07:49:46Z

/bot run

tensorrt-cicd · 2025-11-20T07:55:04Z

PR_Github #25170 [ run ] triggered by Bot. Commit: f5b86ad

tensorrt-cicd · 2025-11-20T09:53:17Z

PR_Github #25170 [ run ] completed with state SUCCESS. Commit: f5b86ad
/LLM/main/L0_MergeRequest_PR pipeline #19031 completed with status: 'FAILURE'

bobboli · 2025-11-20T10:13:06Z

/bot run --reuse-test

tensorrt-cicd · 2025-11-20T10:18:24Z

PR_Github #25193 [ run ] triggered by Bot. Commit: f5b86ad

tensorrt-cicd · 2025-11-20T12:07:31Z

PR_Github #25193 [ run ] completed with state SUCCESS. Commit: f5b86ad
/LLM/main/L0_MergeRequest_PR pipeline #19049 completed with status: 'FAILURE'

bobboli · 2025-11-20T16:14:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-20T16:20:28Z

PR_Github #25219 [ run ] triggered by Bot. Commit: f5b86ad

tensorrt-cicd · 2025-11-20T16:27:59Z

PR_Github #25219 [ run ] completed with state FAILURE. Commit: f5b86ad
/LLM/main/L0_MergeRequest_PR pipeline #19074 completed with status: 'ABORTED'

bobboli · 2025-11-21T02:20:03Z

/bot run --reuse-test

tensorrt-cicd · 2025-11-21T02:26:07Z

PR_Github #25275 [ run ] triggered by Bot. Commit: f5b86ad

tensorrt-cicd · 2025-11-21T03:07:05Z

PR_Github #25275 [ run ] completed with state SUCCESS. Commit: f5b86ad
/LLM/main/L0_MergeRequest_PR pipeline #19120 completed with status: 'FAILURE'

bobboli · 2025-11-21T03:33:27Z

/bot run --reuse-test

tensorrt-cicd · 2025-11-21T03:39:37Z

PR_Github #25292 [ run ] triggered by Bot. Commit: f5b86ad

tensorrt-cicd · 2025-11-21T04:17:41Z

PR_Github #25292 [ run ] completed with state SUCCESS. Commit: f5b86ad
/LLM/main/L0_MergeRequest_PR pipeline #19134 completed with status: 'FAILURE'

bobboli · 2025-11-21T05:52:53Z

/bot run

tensorrt-cicd · 2025-11-21T06:00:38Z

PR_Github #25308 [ run ] triggered by Bot. Commit: 68375ed

tensorrt-cicd · 2025-11-21T06:42:09Z

PR_Github #25308 [ run ] completed with state FAILURE. Commit: 68375ed
/LLM/main/L0_MergeRequest_PR pipeline #19146 completed with status: 'FAILURE'

bobboli · 2025-11-21T08:41:58Z

/bot run

tensorrt-cicd · 2025-11-21T08:47:15Z

PR_Github #25339 [ run ] triggered by Bot. Commit: 68375ed

tensorrt-cicd · 2025-11-21T10:47:48Z

PR_Github #25339 [ run ] completed with state SUCCESS. Commit: 68375ed
/LLM/main/L0_MergeRequest_PR pipeline #19166 completed with status: 'FAILURE'

bobboli · 2025-11-21T16:10:19Z

/bot run --reuse-test

tensorrt-cicd · 2025-11-21T16:17:47Z

PR_Github #25370 [ run ] triggered by Bot. Commit: 68375ed

tensorrt-cicd · 2025-11-21T17:47:48Z

PR_Github #25370 [ run ] completed with state SUCCESS. Commit: 68375ed
/LLM/main/L0_MergeRequest_PR pipeline #19189 completed with status: 'FAILURE'

bobboli · 2025-11-22T01:20:26Z

/bot run --reuse-test

bobboli · 2025-11-22T18:11:58Z

/bot run --reuse-test

tensorrt-cicd · 2025-11-22T18:17:49Z

PR_Github #25428 [ run ] triggered by Bot. Commit: 68375ed

tensorrt-cicd · 2025-11-22T22:40:41Z

PR_Github #25428 [ run ] completed with state SUCCESS. Commit: 68375ed
/LLM/main/L0_MergeRequest_PR pipeline #19242 completed with status: 'FAILURE'

bobboli · 2025-11-23T03:12:58Z

/bot run --reuse-test

tensorrt-cicd · 2025-11-23T03:18:33Z

PR_Github #25437 [ run ] triggered by Bot. Commit: 68375ed

tensorrt-cicd · 2025-11-23T05:26:15Z

PR_Github #25437 [ run ] completed with state SUCCESS. Commit: 68375ed
/LLM/main/L0_MergeRequest_PR pipeline #19251 completed with status: 'FAILURE'

…wo_sided; (namespace) mnnvl_throughput -> moe_comm. Signed-off-by: Bo Li <[email protected]>

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-11-23T15:30:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-23T15:36:17Z

PR_Github #25456 [ run ] triggered by Bot. Commit: c5645fd

tensorrt-cicd · 2025-11-23T21:52:41Z

PR_Github #25456 [ run ] completed with state SUCCESS. Commit: c5645fd
/LLM/main/L0_MergeRequest_PR pipeline #19270 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

bobboli requested a review from a team as a code owner November 20, 2025 06:25

bobboli requested a review from mikeiovine November 20, 2025 06:25

bobboli changed the title ~~[TRTLLM-9389][chore] Rename: mnnvlthroughput -> nvlink_one_sided; mnnvllatency -> nvlink_t…~~ [TRTLLM-9389][chore] Rename AlltoAll backend names Nov 20, 2025

bobboli requested review from dongxuy04, xxi-nv and yuxianq November 20, 2025 06:28

yuxianq reviewed Nov 20, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py Outdated Show resolved Hide resolved

yuxianq approved these changes Nov 20, 2025

View reviewed changes

bobboli enabled auto-merge (squash) November 20, 2025 08:02

dongxuy04 approved these changes Nov 21, 2025

View reviewed changes

xxi-nv approved these changes Nov 21, 2025

View reviewed changes

bobboli added 3 commits November 23, 2025 23:30

Rename: mnnvlthroughput -> nvlink_one_sided; mnnvllatency -> nvlink_t…

e5282a7

…wo_sided; (namespace) mnnvl_throughput -> moe_comm. Signed-off-by: Bo Li <[email protected]>

Fix typo.

ffc31d9

Signed-off-by: Bo Li <[email protected]>

Fix typo.

c5645fd

Signed-off-by: Bo Li <[email protected]>

bobboli force-pushed the alltoall_rename branch from 68375ed to c5645fd Compare November 23, 2025 15:30

bobboli merged commit fcfec93 into NVIDIA:main Nov 23, 2025
4 of 5 checks passed

[TRTLLM-9389][chore] Rename AlltoAll backend names #9329

[TRTLLM-9389][chore] Rename AlltoAll backend names #9329

Conversation

bobboli commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

bobboli commented Nov 20, 2025

Uh oh!

coderabbitai bot commented Nov 20, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

Uh oh!

bobboli commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

bobboli commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

bobboli commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

bobboli commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

bobboli commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

bobboli commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

bobboli commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

bobboli commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

bobboli commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

bobboli commented Nov 20, 2025 •

edited

Loading