[None][feat] AutoDeploy: Remove redundant copies in mamba layers #9461

nvchenghaoz · 2025-11-25T19:19:48Z

Part 2 for #9344

This PR is to remove the redundant copies after causal conv and after the SSM.

Added the copyright files for the files that I touched.

Summary by CodeRabbit

Performance Improvements
- Optimized CUDA causal convolution operations to use in-place memory modifications, reducing memory overhead in mamba model deployment.
Refactor
- Updated internal operation handling for consistency across CUDA and Triton backends in the auto-deploy pipeline.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Chenghao Zhang <[email protected]>

coderabbitai · 2025-11-25T19:24:23Z

📝 Walkthrough

Walkthrough

The changes refactor a CUDA cached causal convolution operation from tensor-returning to in-place modification semantics. A wrapper function is introduced to maintain backward compatibility, while fusion logic is updated to use the new wrapper instead of the raw operator. Output tensor assembly logic in the Triton backend is also adjusted to conditionally construct results from prefill and decode paths.

Changes

Cohort / File(s)	Change Summary
CUDA backend operation refactoring `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py`	`_cuda_cached_causal_conv1d` signature changed to return None; `mutates_args` decorator updated to `{"input"}` for in-place modification; new `cuda_cached_causal_conv1d_wrapper` function introduced to call the op and return input; `get_cached_attention_op` updated to return wrapper; fake registration updated to return None.
Triton backend output handling `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py`	Output tensor construction refactored: introduces `y_prefill` and `y_decode` variables for partial results; final return logic conditionally assembles or returns results based on prefill/decode path presence.
Fusion logic wrapper adoption `tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py`	Hardcoded `torch.ops.auto_deploy.cuda_cached_causal_conv1d` references replaced with imported `cuda_cached_causal_conv1d_wrapper` alias; wrapper used in both pattern matching and fusion call sites.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

In-place modification semantics and wrapper correctness in cuda_backend_causal_conv.py require careful verification
Output tensor assembly logic in triton_backend_mamba.py should be validated for all prefill/decode path combinations
Ensure fusion logic properly invokes the wrapper and maintains expected behavior

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	The description is missing required template sections including detailed explanation of changes, test coverage details, and PR checklist completion.	Complete the PR description by filling in the 'Description' section with detailed explanation, 'Test Coverage' with specific test cases, and check all items in the 'PR Checklist'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: removing redundant copies in mamba layers during AutoDeploy processing.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (1)
207-216: The decode path has a critical bug: the return value of causal_conv1d_update is not captured.

The function causal_conv1d_update returns the convolution output (shape matching x_decode), but the code at lines 207-216 discards this return value. The function does not modify x_decode in place; instead, it copies x_decode into conv_state and returns the computed output. Without capturing the return value, the decode output is lost entirely.

The call should be:
x_decode = causal_conv1d_update(
    x_decode,  # [batch, dim]
    conv_state_cache,
    w2d,
    bias,
    activation=activation,
    cache_seqlens=None,
    conv_state_indices=slot_idx[num_prefill:].to(torch.int32),
    pad_slot_id=PAD_SLOT_ID,
)

🧹 Nitpick comments (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (1)
245-247: Consider adding a docstring to the wrapper function.

The wrapper function correctly calls the in-place op and returns the input, maintaining backward compatibility. However, adding a docstring would improve clarity:
 def cuda_cached_causal_conv1d_wrapper(input, *args, **kwargs):
+    """Wrapper for cuda_cached_causal_conv1d that returns the modified input.
+    
+    The underlying op modifies input in-place; this wrapper provides
+    a functional interface for backward compatibility.
+    """
     torch.ops.auto_deploy.cuda_cached_causal_conv1d(input, *args, **kwargs)
     return input

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afc52d7 and 156b43e.

📒 Files selected for processing (3)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (7 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py

📚 Learning: 2025-10-20T16:54:09.824Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py

🧬 Code graph analysis (2)

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py (2)

tests/unittest/_torch/thop/parallel/test_custom_ops.py (1)

custom_ops (37-42)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (1)

cuda_cached_causal_conv1d_wrapper (245-247)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

args (288-290)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (9)

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py (2)

1-15: LGTM! Copyright header properly added.

The NVIDIA copyright header with Apache 2.0 license has been correctly added to this file, as required by the coding guidelines.

104-119: LGTM! Wrapper integration is clean.

The switch from using the raw operator to the cuda_cached_causal_conv1d_wrapper is implemented correctly. The wrapper is consistently used in both pattern matching (line 109) and fusion (line 119), which aligns with the in-place modification semantics introduced in the CUDA backend.

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py (3)

1-15: LGTM! Copyright header properly added.

The NVIDIA copyright header with Apache 2.0 license has been correctly added to this file, as required by the coding guidelines.

207-208: LGTM! Clean initialization pattern.

Initializing y_prefill and y_decode to None enables the dispatch-based return logic and helps eliminate redundant tensor allocations, aligning with the PR objectives.

282-294: I'll help you verify the shape compatibility in the dispatch return logic. Let me start by examining the mamba triton backend file and locating the relevant function definitions.
<function_calls>

#!/bin/bash

Find the file and check its size

wc -l tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py

</function_calls>
#!/bin/bash

Find mamba_chunk_scan_combined and selective_state_update signatures

rg -n "def mamba_chunk_scan_combined|def selective_state_update" tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py

</function_calls>

Now let me examine the relevant sections of the file to understand the shape flow:
<function_calls>

#!/bin/bash

Read the critical section around lines 282-294 and some context before/after

sed -n '250,310p' tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py

</function_calls>

Let me also search for where y_prefill and y_decode are created:
<function_calls>

#!/bin/bash

Search for y_prefill and y_decode creation

rg -n "y_prefill|y_decode" tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py | head -30

</function_calls>

<function_calls>

#!/bin/bash

Get broader context on y_prefill and y_decode

sed -n '200,295p' tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py

</function_calls>

Let me search for the mamba_chunk_scan_combined and selective_state_update function definitions:
<function_calls>

#!/bin/bash

Search across the entire codebase for mamba_chunk_scan_combined definition

rg -n "def mamba_chunk_scan_combined" --type py

</function_calls>

<function_calls>

#!/bin/bash

Search across the entire codebase for selective_state_update definition

rg -n "def selective_state_update" --type py

</function_calls>

<function_calls>

#!/bin/bash

Look for the function calls in the triton_backend_mamba file

rg -B5 -A5 "mamba_chunk_scan_combined|selective_state_update" tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py

</function_calls>

<function_calls>

#!/bin/bash

Read the entire file to understand context better

wc -l tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py && echo "---" && cat -n tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py

</function_calls>

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (4)

1-15: LGTM! Copyright header properly added.

The NVIDIA copyright header with Apache 2.0 license has been correctly added to this file, as required by the coding guidelines.

112-140: LGTM! In-place semantics properly declared.

The changes correctly implement in-place mutation semantics:

mutates_args={"input"} properly declares the mutation to PyTorch

Return type changed to None is consistent with in-place operations

Documentation clearly states the in-place behavior

This refactoring aligns with the PR objective of removing redundant copies.

198-199: LGTM! Prefill path correctly implements in-place modification.

The scatter operation inp_flat[:total_prefill_tokens] = y_varlen.transpose(0, 1) correctly writes the results back to the input buffer. Since inp_flat is a view of input (line 156), the modifications properly propagate.

242-242: LGTM! Fake registration and wrapper exposure are correct.

Line 242: The fake function correctly returns None, consistent with the in-place operation

Line 273: The wrapper is correctly returned by get_cached_attention_op, providing the public API while encapsulating the in-place semantics

Also applies to: 273-273

nvchenghaoz · 2025-11-25T23:25:47Z

/bot run

tensorrt-cicd · 2025-11-25T23:31:48Z

PR_Github #25781 [ run ] triggered by Bot. Commit: e205b9d

tensorrt-cicd · 2025-11-26T00:37:14Z

PR_Github #25781 [ run ] completed with state SUCCESS. Commit: e205b9d
/LLM/main/L0_MergeRequest_PR pipeline #19555 completed with status: 'FAILURE'

nvchenghaoz · 2025-11-26T05:05:06Z

/bot run

tensorrt-cicd · 2025-11-26T05:11:05Z

PR_Github #25823 [ run ] triggered by Bot. Commit: e205b9d

tensorrt-cicd · 2025-11-26T05:51:30Z

PR_Github #25823 [ run ] completed with state SUCCESS. Commit: e205b9d
/LLM/main/L0_MergeRequest_PR pipeline #19585 completed with status: 'FAILURE'

nvchenghaoz · 2025-11-26T17:06:38Z

/bot run

tensorrt-cicd · 2025-11-26T17:12:27Z

PR_Github #25867 [ run ] triggered by Bot. Commit: e205b9d

tensorrt-cicd · 2025-11-26T18:39:42Z

PR_Github #25867 [ run ] completed with state SUCCESS. Commit: e205b9d
/LLM/main/L0_MergeRequest_PR pipeline #19613 completed with status: 'FAILURE'

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz · 2025-11-26T19:57:54Z

/bot run

tensorrt-cicd · 2025-11-26T20:08:23Z

PR_Github #25883 [ run ] triggered by Bot. Commit: a67f3f4

tensorrt-cicd · 2025-11-26T22:34:25Z

PR_Github #25883 [ run ] completed with state SUCCESS. Commit: a67f3f4
/LLM/main/L0_MergeRequest_PR pipeline #19628 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

nvchenghaoz added 2 commits November 25, 2025 11:17

Remove the redundant copy for mamba layers

41835b7

Signed-off-by: Chenghao Zhang <[email protected]>

Add copyright

156b43e

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz requested a review from a team as a code owner November 25, 2025 19:19

github-project-automation bot added this to AutoDeploy Board Nov 25, 2025

nvchenghaoz requested a review from Fridah-nv November 25, 2025 19:19

github-project-automation bot moved this to Backlog in AutoDeploy Board Nov 25, 2025

nvchenghaoz requested review from galagam and suyoggupta November 25, 2025 19:23

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

suyoggupta approved these changes Nov 25, 2025

View reviewed changes

github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Nov 25, 2025

Merge branch 'main' into chenghao/mamba_perf_1125

e205b9d

nvchenghaoz changed the title ~~[None][feat] AutoDeploy: Remove redundant copies in mamba layers.~~ [None][feat] AutoDeploy: Remove redundant copies in mamba layers Nov 25, 2025

Fix the test failure

a67f3f4

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz merged commit bc7b60e into NVIDIA:main Nov 26, 2025
5 checks passed

github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 26, 2025

[None][feat] AutoDeploy: Remove redundant copies in mamba layers #9461

[None][feat] AutoDeploy: Remove redundant copies in mamba layers #9461

Uh oh!

Conversation

nvchenghaoz commented Nov 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Find the file and check its size

Find mamba_chunk_scan_combined and selective_state_update signatures

Read the critical section around lines 282-294 and some context before/after

Search for y_prefill and y_decode creation

Get broader context on y_prefill and y_decode

Search across the entire codebase for mamba_chunk_scan_combined definition

Search across the entire codebase for selective_state_update definition

Look for the function calls in the triton_backend_mamba file

Read the entire file to understand context better

Uh oh!

nvchenghaoz commented Nov 25, 2025

Uh oh!

tensorrt-cicd commented Nov 25, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

nvchenghaoz commented Nov 26, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

nvchenghaoz commented Nov 26, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

nvchenghaoz commented Nov 26, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvchenghaoz commented Nov 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 25, 2025 •

edited

Loading