Skip to content

Conversation

@brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented Nov 20, 2025

Description

This MR integrates helix parallelism, an experimental feature, in TRTLLM.

Background:

  • Helix parallelism is a decode-only context parallelism method. Hence, it's used in disaggregated setting where only gen servers would have helix.
  • This involves sharding the request's seqlen across multiple CP (context parallel) ranks.
  • For a given query token in decode phase, “local attention” is computed w.r.t previous tokens on each CP rank.
  • Ensuing communication among CP ranks enables “correction” of local attention such that attention computation is exact.
  • Given KV parallelism is applicable only to attn layer, CP GPUs are "repurposed" to TP GPUs for FFN layer.

Changes in this MR:

  • At a broader level, we enable helix parallelism with DeepseekV3 and add a disagg integration test (a smoke test for now).
  • Example to explain the core changes:
    • Suppose we are dealing with the first decode step for a request with ISL 7 and gen server has two-way context parallelism i.e. cpSize=2.
    • Let's say first 4 tokens reside on cpRank0 and next 3 tokens reside on cpRank1.
    • We have an incoming query token, q7 (corresponding to first generated token). While we perform local attn computation wrt to q7 on both cpRanks, its KV cache is written only to one cpRank (rank1 in the example) and the kv7 is also considered in local attn only on that rank. We call this rank "active helix rank".
  • Known limitation: Currently only the last CP rank is considered active rank. This shall be lifted in a follow-up MR.
image

Most changes in this MR enforce this:

  • KV cache is added for query token only on active rank in resource_manager.py.
  • Actual KV cache write happens in mla rope kernels and changes to rope kernels skip writing KV cache on inactive ranks.
  • The number of tokens considered in local attn computation is determined by seq_len_kv in trtllm.py which is also adjusted accordingly.

"Repurposing" attn CP ranks to FFN TP ranks can make things quite messy. To keep this readable,

  • We pass mapping with CP only to the attention layers in modeling_deepseekv3.py and pass mapping without cp to the rest.
  • We use a similar trick in communicator.py to obtain the right TP groups.

Test Coverage

$ pytest tests/unittest/_torch/modules/test_mla_helix.py -s -v
$ TRTLLM_USE_UCX_KVCACHE=1 TLLM_LOG_LEVEL=INFO pytest tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix -s -v

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added context parallelism support with Helix-based distributed inference capabilities
    • DeepSeekV3 model now supports context parallelism for enhanced performance on multi-GPU setups
    • New --cp_size command-line argument for configuring context parallel size (default: 1)
    • Enhanced disaggregated serving configuration for context-tensor parallel distribution
  • Tests

    • Added new test configuration for disaggregated DeepSeekV3 inference with context parallelism

✏️ Tip: You can customize this high-level summary in your review settings.

@brb-nv brb-nv changed the title User/brb/integrate helix on main redo mr [None][feat] Integrate helix parallelism Nov 20, 2025
@brb-nv brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 2 times, most recently from 812dfb9 to 50436a1 Compare November 21, 2025 17:43
@brb-nv brb-nv marked this pull request as ready for review November 21, 2025 17:51
@brb-nv brb-nv requested review from a team as code owners November 21, 2025 17:51
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

📝 Walkthrough

Walkthrough

This pull request implements context parallelism support with Helix configuration across the TensorRT-LLM inference stack. It adds per-rank inactivity tracking (helix_is_inactive_rank) to CUDA kernels and Python layers, introduces CP size configuration parameters, implements mapping repurposing logic for CP/TP distribution, and extends model initialization and executor logic to handle inactive Helix ranks during generation.

Changes

Cohort / File(s) Summary
CUDA Kernel Signatures
cpp/tensorrt_llm/kernels/mlaKernels.cu, cpp/tensorrt_llm/kernels/mlaKernels.h
Added helix_is_inactive_rank boolean pointer parameter to MLA rope generation kernel signatures; threaded through kernel invocations to gate token processing and K/V updates based on rank inactivity status.
Tensor Operations & Rope Generation
cpp/tensorrt_llm/thop/attentionOp.cpp, cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
Extended MLA tensor parameter handling to expect and forward two tensors (helix_position_offsets, helix_is_inactive_rank); added new field to MlaRopeGenArgs struct and propagated inactive rank mask through rope generation pipelines.
Torch Attention Backend
tensorrt_llm/_torch/attention_backend/trtllm.py
Added helix_position_offsets and helix_is_inactive_rank to plan/forward/mla_rope_generation APIs; extended TrtllmAttentionMetadata with inactive rank tracking; adjusted KV length planning to exclude inactive rank contributions.
Distributed Communication
tensorrt_llm/_torch/distributed/communicator.py
Implemented early CP communicator creation and mapping repurposing logic: when cp_size > 1, creates a copy with Helix mapping, scales TP by CP size, and restores original mapping after TP/PP communicator initialization.
Model Architecture
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Extended DeepseekV3 layer constructors with optional mapping_with_cp parameter; added CP rank/size extraction and weight-split logic for KV projection; implemented mapping repurposing during model initialization for cp_size > 1.
Attention Modules
tensorrt_llm/_torch/modules/attention.py
Added mapping_with_cp parameter to MLA and Attention constructors; enforced num_heads equality and Helix CP type validation; updated forward paths to propagate helix parameters and support position_ids threading.
Executor & Resource Management
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py, tensorrt_llm/_torch/pyexecutor/llm_request.py, tensorrt_llm/_torch/pyexecutor/model_engine.py, tensorrt_llm/_torch/pyexecutor/resource_manager.py
Added py_helix_is_inactive_rank flag to LlmRequest; implemented helix inactive rank tracking in model engine with conditional position/token calculations; gated KV cache allocation for inactive ranks in resource manager; extended AttentionMetadata with inactive rank exposure.
CLI & Configuration
examples/llm-api/quickstart_advanced.py, tensorrt_llm/commands/serve.py
Added --cp_size and cp_config command-line arguments; propagated context_parallel_size through LLM initialization; implemented cp_type string-to-enum conversion with validation.
Infrastructure & Mapping
tensorrt_llm/llmapi/disagg_utils.py, tensorrt_llm/mapping.py
Updated instance rank calculation to include context_parallel_size; added hardcoded Helix CP type fallback when cp_size > 1 to override externally provided cp_config.
Test Infrastructure
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml, tests/integration/defs/disaggregated/test_disaggregated.py
Added new disaggregated test configuration file for context TP and generation Helix setup; introduced test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix test case with model symlink setup.

Sequence Diagram(s)

sequenceDiagram
    participant Request
    participant ResourceMgr as Resource<br/>Manager
    participant ModelEngine
    participant AttentionBE as Attention<br/>Backend
    participant MLAKernel as MLA<br/>Kernel

    Request->>ResourceMgr: prepare_resources()
    activate ResourceMgr
    alt cp_size > 1 and not last rank
        ResourceMgr->>ResourceMgr: mark py_helix_is_inactive_rank=true
        ResourceMgr->>ResourceMgr: skip KV cache allocation
    else active rank
        ResourceMgr->>ResourceMgr: allocate KV cache normally
    end
    deactivate ResourceMgr

    Request->>ModelEngine: forward pass (generation)
    activate ModelEngine
    alt helix_is_inactive_rank[batch]==true
        ModelEngine->>ModelEngine: fix past_seen_token_num<br/>(no increment)
        ModelEngine->>ModelEngine: skip token processing
    else active
        ModelEngine->>ModelEngine: increment past_seen_token_num
        ModelEngine->>AttentionBE: plan() with helix params
    end
    deactivate ModelEngine

    AttentionBE->>AttentionBE: adjust kv_lens planning<br/>(exclude inactive ranks)
    AttentionBE->>MLAKernel: mla_rope_generation<br/>(helix_is_inactive_rank)
    activate MLAKernel
    alt helix_is_inactive_rank[batch]==true
        MLAKernel->>MLAKernel: skip token processing
        MLAKernel->>MLAKernel: skip K/V updates
    else active
        MLAKernel->>MLAKernel: apply rope & assign QKV
        MLAKernel->>MLAKernel: update K/V cache
    end
    deactivate MLAKernel
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • Mapping repurposing logic (communicator.py, modeling_deepseekv3.py, mapping.py): Core logic for switching between CP and TP distributions; mutations and restorations must be correctly sequenced and scoped to avoid state leaks.
  • KV length planning adjustments (trtllm.py, model_engine.py): Changes to how KV cache lengths are calculated when inactive ranks are present; verify accounting is correct for all rank states.
  • Warmup control flow (model_engine.py): Conditional position_id and past_seen_token_num calculations based on warmup state and inactivity; ensure all branches are consistent.
  • Cross-layer parameter threading (executor_request_queue.py, model_engine.py, resource_manager.py): helix_is_inactive_rank flows through multiple abstraction layers; verify end-to-end propagation and type conversions (bool → tensor → pointer).
  • Model initialization side effects (modeling_deepseekv3.py): Temporary mapping mutations during model construction; verify original mapping is reliably restored even on error paths.

Suggested reviewers

  • schetlur-nv
  • nvchenghaoz
  • Shixiaowei02
  • Superjomn
  • Tabrizian
  • Funatiq
  • QiJune

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.04% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][feat] Integrate helix parallelism' directly and accurately summarizes the main change—integrating helix parallelism support into TensorRT-LLM, which is clearly the primary focus of this PR.
Description check ✅ Passed The PR description provides a comprehensive explanation of helix parallelism, background context, specific implementation details with an example, test coverage commands, and confirmation of the PR checklist.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/mapping.py (1)

455-491: Don’t silently override cp_config to HELIX for any cp_size > 1

This block in Mapping.__init__:

#################################################################
# TODO: Remove this hardcoding and obtain cp_config from llm_args.
if cp_size > 1:
    cp_config = {"cp_type": CpType.HELIX}
#################################################################

has broad side effects:

  • Any caller that provides a non-Helix cp_config (e.g. STAR or ULYSSES) with cp_size > 1 now gets that configuration silently discarded and treated as HELIX.
  • Code that branches on cp_config["cp_type"] (e.g. _merge_requests in executor_request_queue.py, STAR attention paths, etc.) will never see CpType.STAR/ULYSSES once cp_size > 1, effectively breaking those CP modes.
  • Additional cp_config fields (like STAR’s block_size / cp_anchor_size, or future Helix parameters) are lost.

If the intent is “for now we only support Helix when cp_size > 1”, it’s safer to:

  • Only inject a default when cp_config is missing; and
  • Fail fast on conflicting configs instead of overriding them:
# Temporary default until cp_config is fully plumbed from llm_args.
if cp_size > 1:
    if cp_config is None:
        cp_config = {"cp_type": CpType.HELIX}
    elif cp_config.get("cp_type") != CpType.HELIX:
        raise ValueError(
            f"Only CpType.HELIX is currently supported when cp_size > 1; got {cp_config.get('cp_type')!r}"
        )

That keeps Helix as the only supported multi-CP mode in this PR, but avoids surprising behavior for existing STAR/ULYSSES configs and makes future extension to other CP types straightforward.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

1568-1623: Tighten helix_is_inactive_rank initialization guard; verify warmup dummy request semantics for Helix

The new Helix logic is mostly sound, but there is one definite initialization bug and one edge case to confirm:

  1. helix_is_inactive_rank initialization guard is incorrect

    The current initialization at line 1568:

    helix_is_inactive_rank = [] if self.mapping.cp_size > 1 else None

    initializes to an empty list for all CP configurations with cp_size > 1, but has_cp_helix() returns True only when both cp_size > 1 and cp_type == CpType.HELIX. For non-Helix CP types (e.g., regular CP or other variants), this creates an empty list that never gets populated, diverging from the None state that downstream consumers expect when Helix is disabled.

    Fix: Change line 1568 to:

    helix_is_inactive_rank = [] if self.mapping.has_cp_helix() else None
  2. Warmup + Helix: past_seen_token_num override semantics need verification

    During warmup, you correctly skip the position_id computation (line 1605), but past_seen_token_num is unconditionally overridden based on request.orig_prompt_len (lines 1608–1612) whenever Helix is active. This is fed into num_cached_tokens_per_seq, which becomes part of KVCacheParams. For dummy warmup requests, ensure:

    • orig_prompt_len is consistently initialized for all dummy request types created during warmup, and
    • the resulting KV cache index values remain within valid bounds on inactive Helix ranks.
  3. Per-request inactivity flag wiring looks correct

    The per-beam append pattern (lines 1572–1619) produces a helix_is_inactive_rank list with length equal to the total batch size (sum of beam widths), which matches the attention backend's [batch_size] expectation.


2526-2537: Behavioral inconsistency confirmed: ULYSSES passes warmup checks but fails at runtime

The review concern is valid. I found that the change introduces a systemic breaking behavior across three PyExecutor methods:

  1. model_engine.py._prepare_inputs (line 2536): Raises for non-STAR/HELIX
  2. executor_request_queue.py._merge_requests (line 725): Raises for non-STAR/HELIX
  3. py_executor.py._update_request_states (line 2072): Raises for non-STAR/HELIX

The critical inconsistency:

  • Warmup check (model_engine.py line 564) accepts ULYSSES and returns early
  • Runtime execution (line 2536) raises NotImplementedError if ULYSSES reaches _prepare_inputs
  • This means if someone configures PyExecutor with cp_type=ULYSSES, it will pass initialization but crash during inference

ULYSSES is defined in the CpType enum and explicitly referenced at line 564, indicating it was intended to be handled. However, no fallback path exists in the three runtime dispatch methods, and no test coverage was found for ULYSSES with PyExecutor. The previous behavior would have silently fallen through to the default _prepare_tp_inputs path.

While there's no evidence that existing code uses ULYSSES with PyExecutor, the enum inclusion and warmup-time acceptance create an expectation of support that the runtime contradicts.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1446-1451: Fix TP sharding after restoring the original mapping

During DeepseekV3ForCausalLM.__init__ we repurpose CP ranks into TP by installing a temporary Mapping (tp_size = tp * cp). All decoder/MTP modules capture that object via self.mapping. Later we restore model_config.mapping back to the original CP-aware mapping. Here in DeepseekV3MTP.forward, the chunking uses self.model_config.mapping.tp_size/tp_rank, which now point to the restored mapping and no longer match the row-parallel tensors created with the repurposed mapping. On Helix runs (cp_size > 1) this leaves each rank feeding the wrong slice (or no slice) into eh_proj, breaking generation.

Use the same mapping object that the layer captured during init. A minimal fix:

-        tp_size = self.model_config.mapping.tp_size
-        tp_rank = self.model_config.mapping.tp_rank
+        tp_size = self.mapping.tp_size
+        tp_rank = self.mapping.tp_rank

That keeps the MTP sharding consistent with the repurposed TP groups.

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

645-705: Helix merge: avoid hardcoded tokens_per_block and ensure total_input_len_cp is available on children

Two points in the Helix path:

  1. Hardcoded tokens_per_block=32
elif cp_type == CpType.HELIX:
    return self._merge_helix_requests(
        new_requests,
        tokens_per_block=32)
        # tokens_per_block=cp_config['tokens_per_block'])
  • This ignores any configured Helix block size (e.g. via cp_config['tokens_per_block'] or KV cache config) and makes the behavior fragile if someone changes the configured block size away from 32.
  • It also repeats a TODO you already noted to remove this hardcoding.

Suggestion:

  • Prefer pulling from config with a safe default + assert, e.g.:
tokens_per_block = cp_config.get('tokens_per_block', 32)
assert tokens_per_block > 0
return self._merge_helix_requests(new_requests, tokens_per_block=tokens_per_block)

or at minimum assert that a configured value, if present, matches 32 so misconfigurations fail loudly instead of silently diverging.

  1. total_input_len_cp not propagated to child requests
req = executor_request_to_llm_request(...)
req.total_input_len_cp = input_len
req_with_children.append(req)
if req.child_requests:
    req_with_children.extend(req.child_requests)
  • executor_request_to_llm_request creates child requests via LlmRequest.create_child_request, which only copies attributes whose names start with py_.
  • As a result, total_input_len_cp exists only on the parent; any downstream code that expects this attribute on every LlmRequest (including children when num_return_sequences > 1) will not find it.

Possible fix:

  • Either rename to follow the py_ convention so it’s auto-copied:
req.py_total_input_len_cp = input_len
for child in req.child_requests:
    child.py_total_input_len_cp = input_len
  • Or, if you deliberately want a non-py_ attribute, explicitly set it on children in this loop.

This will keep Helix metadata consistent across parent and child requests and future-proof the code against differing tokens_per_block configs.

#!/bin/bash
# Check how Helix-related fields are used so they stay consistent.
rg -n "total_input_len_cp" -C3
rg -n "tokens_per_block" tensorrt_llm/_torch/pyexecutor -C3

Also applies to: 710-723

🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

440-449: Decode-time KV allocation correctly gated on active Helix rank

Marking req.py_helix_is_inactive_rank on non-last CP ranks when has_cp_helix() and skipping add_token there ensures only the active Helix rank allocates decode-time KV cache, which matches the design.

You might consider using mapping.is_last_cp_rank() (and/or setting this flag once at request construction) for slightly clearer intent, but the current logic is functionally sound.

examples/llm-api/quickstart_advanced.py (1)

71-76: cp_size flag and context_parallel_size wiring are consistent

The new --cp_size argument and its use as context_parallel_size=args.cp_size in the LLM constructor align with the new CP plumbing. The change is self-contained and doesn’t affect existing callers.

Optionally, you might extend the help string for --cp_size to mention that multi-CP currently implies Helix in this flow, so users know what they’re opting into.

Also applies to: 261-264

tests/integration/defs/disaggregated/test_disaggregated.py (1)

154-274: New DeepSeek V3 Lite bf16 Helix disaggregated test wiring looks consistent

The new config entry and test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix follow the same symlink + run_disaggregated_test pattern as the existing DeepSeek tests, and the test_desc string matches the key added to config_map, so the wiring looks correct.

If you want to silence Ruff’s ARG001 warning, you could rename disaggregated_test_root to _disaggregated_test_root in the new test (or add a # noqa: ARG001), but that’s cosmetic and consistent with the rest of this file.

Also applies to: 1915-1933

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

12-13: Remove duplicate LlmRequest import

LlmRequest is imported twice in this file (here and again at line 62). You can safely drop the earlier import and keep the one that also brings in get_draft_token_length:

-from .llm_request import LlmRequest
-
 import torch

This keeps imports minimal without changing behavior.

tensorrt_llm/commands/serve.py (1)

5-5: Drop the duplicate gc import
Line 2 already imports gc, so this second import triggers Ruff F811 (Redefinition of unused gc). Please drop the extra line to keep lint happy.

-import gc
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2abb8 and 50436a1.

📒 Files selected for processing (18)
  • cpp/tensorrt_llm/kernels/mlaKernels.cu (4 hunks)
  • cpp/tensorrt_llm/kernels/mlaKernels.h (1 hunks)
  • cpp/tensorrt_llm/thop/attentionOp.cpp (2 hunks)
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (6 hunks)
  • examples/llm-api/quickstart_advanced.py (2 hunks)
  • tensorrt_llm/_torch/attention_backend/trtllm.py (9 hunks)
  • tensorrt_llm/_torch/distributed/communicator.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (12 hunks)
  • tensorrt_llm/_torch/modules/attention.py (10 hunks)
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
  • tensorrt_llm/commands/serve.py (7 hunks)
  • tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
  • tensorrt_llm/mapping.py (1 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (2 hunks)
🧰 Additional context used
🧠 Learnings (27)
📓 Common learnings
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • tensorrt_llm/llmapi/disagg_utils.py
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • tensorrt_llm/llmapi/disagg_utils.py
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • cpp/tensorrt_llm/kernels/mlaKernels.cu
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

  • tensorrt_llm/llmapi/disagg_utils.py
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

  • tensorrt_llm/_torch/distributed/communicator.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/defs/disaggregated/test_disaggregated.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/integration/defs/disaggregated/test_disaggregated.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

  • cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/kernels/mlaKernels.cu
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

  • tensorrt_llm/mapping.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tensorrt_llm/commands/serve.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tensorrt_llm/commands/serve.py
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

  • tensorrt_llm/_torch/modules/attention.py
🧬 Code graph analysis (12)
tensorrt_llm/_torch/distributed/communicator.py (2)
tensorrt_llm/mapping.py (3)
  • Mapping (351-515)
  • rank (187-188)
  • rank (191-198)
tensorrt_llm/llmapi/llm_args.py (2)
  • world_size (459-460)
  • world_size (469-473)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
  • disaggregated_test_root (2618-2623)
  • disaggregated_example_root (285-290)
  • llm_venv (702-719)
  • deepseek_v3_model_root (616-631)
tests/integration/defs/local_venv.py (1)
  • get_working_directory (43-49)
cpp/tensorrt_llm/kernels/mlaKernels.cu (1)
cpp/tensorrt_llm/kernels/mlaKernels.h (2)
  • helix_position_offsets (109-134)
  • helix_is_inactive_rank (112-113)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)
tensorrt_llm/_torch/distributed/communicator.py (5)
  • tp_size (64-65)
  • has_pp (52-53)
  • cp_size (56-57)
  • rank (40-41)
  • rank (457-458)
tensorrt_llm/mapping.py (3)
  • has_pp (258-259)
  • rank (187-188)
  • rank (191-198)
tensorrt_llm/mapping.py (1)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • cp_size (56-57)
  • cp_config (108-109)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/runtime/model_runner.py (1)
  • mapping (824-825)
tensorrt_llm/_torch/distributed/communicator.py (3)
  • has_cp_helix (104-105)
  • cp_rank (68-69)
  • cp_size (56-57)
tensorrt_llm/mapping.py (2)
  • has_cp_helix (233-235)
  • cp_rank (534-535)
tensorrt_llm/_torch/device_mesh.py (1)
  • cp_rank (84-86)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
tensorrt_llm/_torch/distributed/communicator.py (8)
  • cp_size (56-57)
  • cp_rank (68-69)
  • tp_size (64-65)
  • world_size (44-45)
  • rank (40-41)
  • rank (457-458)
  • cp_config (108-109)
  • pp_size (60-61)
tensorrt_llm/mapping.py (4)
  • cp_rank (534-535)
  • Mapping (351-515)
  • rank (187-188)
  • rank (191-198)
tensorrt_llm/_torch/model_config.py (1)
  • ModelConfig (75-616)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
  • LlmRequest (437-663)
  • append (101-127)
  • append (195-212)
  • cached_tokens (569-570)
  • cached_tokens (573-576)
tensorrt_llm/mapping.py (3)
  • CpType (24-32)
  • has_cp_helix (233-235)
  • cp_rank (534-535)
tensorrt_llm/_torch/distributed/communicator.py (3)
  • cp_size (56-57)
  • has_cp_helix (104-105)
  • cp_rank (68-69)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
  • is_warmup (344-345)
  • is_warmup (348-353)
tensorrt_llm/commands/serve.py (3)
tensorrt_llm/runtime/model_runner.py (1)
  • mapping (824-825)
tensorrt_llm/mapping.py (1)
  • CpType (24-32)
tensorrt_llm/_torch/distributed/communicator.py (4)
  • cp_config (108-109)
  • tp_size (64-65)
  • pp_size (60-61)
  • cp_size (56-57)
examples/llm-api/quickstart_advanced.py (1)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • cp_size (56-57)
tensorrt_llm/_torch/attention_backend/trtllm.py (3)
cpp/tensorrt_llm/kernels/mlaKernels.h (2)
  • helix_position_offsets (109-134)
  • helix_is_inactive_rank (112-113)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
  • cached_token_lens (116-118)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • seq_lens_kv (226-227)
  • seq_lens_kv (230-237)
tensorrt_llm/_torch/modules/attention.py (3)
cpp/tensorrt_llm/kernels/helixKernels.h (1)
  • tensorrt_llm (26-46)
tensorrt_llm/mapping.py (1)
  • CpType (24-32)
cpp/tensorrt_llm/kernels/mlaKernels.h (1)
  • helix_position_offsets (109-134)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/distributed/communicator.py

349-349: f-string without any placeholders

Remove extraneous f prefix

(F541)


367-367: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/disaggregated/test_disaggregated.py

1919-1919: Unused function argument: disaggregated_test_root

(ARG001)

tensorrt_llm/_torch/models/modeling_deepseekv3.py

1561-1561: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)


1588-1588: f-string without any placeholders

Remove extraneous f prefix

(F541)


1660-1660: f-string without any placeholders

Remove extraneous f prefix

(F541)

tensorrt_llm/commands/serve.py

5-5: Redefinition of unused gc from line 2

Remove definition: gc

(F811)


130-131: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


130-131: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/llmapi/disagg_utils.py (1)

199-201: Including PP and CP in instance_num_ranks is consistent with mapping/world-size logic

Using tensor_parallel_size * pipeline_parallel_size * context_parallel_size makes the per-instance rank count match the full mapping used later in split_world_comm, while preserving backward compatibility when PP/CP are omitted (default 1). Looks good.

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1)

1-32: Helix disaggregated gen config is internally consistent

Context and generation sections use TP/CP sizes in a way that matches the updated disaggregation logic (context_parallel_size only on the generation side for Helix decode-only). No issues spotted.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

441-513: py_helix_is_inactive_rank flag wiring is consistent

Initializing self.py_helix_is_inactive_rank = False alongside other py_* fields and relying on create_child_request’s py_ copying is exactly what the resource manager needs for per-request Helix inactivity tracking. No changes requested.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

316-320: CP-aware attachment of Python-only request metadata looks good

Extending the condition to (tp_size > 1 or has_pp or cp_size > 1) ensures CP-only topologies also receive Python-side attributes on non-root ranks, without affecting single-rank setups. This is the right direction.

cpp/tensorrt_llm/thop/attentionOp.cpp (1)

181-238: No changes needed; concerns are not supported by the codebase

This review comment identifies issues that do not exist in the actual code:

  • Backward compatibility: All current Python call sites already pass exactly 2 tensors in mla_tensor_params (lines 481 and 1771 in tensorrt_llm/_torch/attention_backend/trtllm.py). The size() == 2 check enforces the existing pattern consistently, not a breaking change.

  • Uninitialized helix fields: The MlaParams struct explicitly initializes both helix_position_offsets{nullptr} and helix_is_inactive_rank{nullptr} as member initializers, ensuring they are always zero-initialized regardless of constructor form. The suggested MlaParams<T> mla_params{} is redundant.

  • API design: The required parameter without defaults aligns with TensorRT-LLM's intentional design for consistency in tensor parameter collections (documented in learnings from PR 6904).

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/kernels/mlaKernels.h (1)

108-112: Helix inactivity flag plumbed cleanly into MLA params

The new helix_is_inactive_rank pointer mirrors the existing helix_position_offsets pattern and keeps the struct’s nullable‑by‑default contract; no issues spotted here.

cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1)

45-70: Helix inactivity flag is correctly threaded through MLA rope generation

The new helix_is_inactive_rank pointer is plumbed cleanly from mla_tensor_params into MlaRopeGenArgs and then into MlaParams, with a strict but reasonable contract (mla_tensor_params.size() == 2). Optional handling via has_value() ? data_ptr : nullptr matches the existing pattern for helix_position_offsets. No functional issues spotted.

Also applies to: 88-111, 139-168, 277-283

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

545-566: Warmup gating by cp_type now correctly excludes only ULYSSES/STAR

The updated warmup logic:

cp_type = self.mapping.cp_config.get('cp_type', None)
if cp_type is not None:
    if cp_type in [CpType.ULYSSES, CpType.STAR]:
        return

means HELIX (and other non‑ULYSSES/STAR cp types) still run warmup, which is what you want for Helix CUDA graph / torch.compile specialization. This looks consistent with the new Helix integration and doesn’t affect non‑CP runs.

Comment on lines 855 to 873
if self.helix_is_inactive_rank is not None and len(
self.helix_is_inactive_rank):
# If helix is inactive, attend to the previously cached tokens only.
# This gets further complicated with multiple requests as each request might
# have a different active helix rank.
assert cached_token_lens is not None, "cached_token_lens should be set for helix"
kv_lens = cached_token_lens
helix_is_inactive_rank_cpu = torch.tensor(
self.helix_is_inactive_rank,
dtype=torch.bool,
device='cpu',
)
active_rank = ~helix_is_inactive_rank_cpu
kv_lens[active_rank] += self.seq_lens_kv[active_rank]
else:
kv_lens = cached_token_lens + self.seq_lens_kv if cached_token_lens is not None else self.seq_lens_kv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve cached token counts when gating inactive Helix ranks
kv_lens = cached_token_lens keeps both names pointing to the same tensor, so the in-place kv_lens[active_rank] += … also mutates cached_token_lens. Downstream code (e.g., prepare_context_mla_with_cached_kv) expects cached_token_lens to still represent only the pre-existing cache, so the extra in-place additions inflate every consumer (num_ctx_cached_tokens, max_ctx_cached_token_len, etc.). On Helix-enabled runs this corrupts scheduler state and can mis-size allocations or loop bounds.
Grab a detached copy before modifying:

-        kv_lens = cached_token_lens
+        kv_lens = cached_token_lens.clone()
         kv_lens[active_rank] += self.seq_lens_kv[active_rank]
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/attention_backend/trtllm.py around lines 855-870, kv_lens
is set to reference cached_token_lens and then mutated in-place, which corrupts
cached_token_lens for downstream consumers; fix by making kv_lens a copy (e.g.,
kv_lens = cached_token_lens.clone() or cached_token_lens.detach().clone()) so
subsequent kv_lens[active_rank] += ... mutates only the local copy and leaves
cached_token_lens unchanged, preserving expected cached counts.

Comment on lines 345 to 362
self.create_cp_comm()
# Repurpose CP ranks to TP for Helix so that the right comms are created.
mapping_with_helix = None
if self.mapping.cp_size > 1:
print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
mapping_with_helix = copy.deepcopy(self.mapping)
mapping_without_helix = Mapping(
world_size=self.mapping.world_size,
rank=self.mapping.rank,
gpus_per_node=self.mapping.gpus_per_node,
cp_size=1,
cp_config={},
tp_size=self.mapping.tp_size * self.mapping.cp_size,
pp_size=self.mapping.pp_size,
moe_ep_size=self.mapping.moe_ep_size,
enable_attention_dp=self.mapping.enable_attention_dp)
self.mapping = mapping_without_helix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Avoid deep‑copying Mapping and consider reusing more fields in the Helix repurposed mapping

The temporary repurposing of CP into TP is a good approach for Helix, but two details are worth tightening:

  1. Deep copy of mapping is unnecessary and may desync shared state

    You currently do:

    mapping_with_helix = copy.deepcopy(self.mapping)
    ...
    self.mapping = mapping_with_helix

    This makes MPIDist.mapping refer to a new Mapping instance, different from the original object passed into the engine, which could be surprising if other components rely on identity or mutate the original mapping later.

    Since you don’t mutate the original mapping object here, a plain alias is enough:

  •    mapping_with_helix = None
    
  •    if self.mapping.cp_size > 1:
    
  •        ...
    
  •        mapping_with_helix = copy.deepcopy(self.mapping)
    
  •    mapping_with_helix = None
    
  •    if self.mapping.cp_size > 1:
    
  •        ...
    
  •        mapping_with_helix = self.mapping
    
    
    
  1. mapping_without_helix only forwards a subset of Mapping kwargs

    Today you only pass world_size, rank, gpus_per_node, {cp,tp,pp}_size, moe_ep_size, and enable_attention_dp. If future code in Mapping starts deriving TP/PP groups from other fields (e.g., pp_partition, moe_cluster_size, moe_tp_size, attn_tp_size, attn_cp_size), mapping_without_helix may no longer faithfully represent the original layout.

    Consider either:

    • adding the missing fields when constructing Mapping(...), or
    • adding a helper on Mapping (e.g., mapping.repurpose_for_helix_tp()) that centralizes this logic.

    That keeps communicator initialization robust to future Mapping changes.


I'll verify the concerns raised in this review comment by investigating the codebase.


---

🏁 Script executed:

```shell
#!/bin/bash
# Find the Mapping class definition to understand all available fields
rg -n "class Mapping" --type=py

Length of output: 101


🏁 Script executed:

#!/bin/bash
# Look at the complete context around lines 345-361
sed -n '330,380p' tensorrt_llm/_torch/distributed/communicator.py

Length of output: 1847


🏁 Script executed:

#!/bin/bash
# Check how mapping_with_helix is used after creation
rg -n "mapping_with_helix" tensorrt_llm/_torch/distributed/communicator.py -A5 -B2

Length of output: 1008


🏁 Script executed:

#!/bin/bash
# Check how mapping_without_helix is used
rg -n "mapping_without_helix" tensorrt_llm/_torch/distributed/communicator.py -A3 -B2

Length of output: 620


🏁 Script executed:

#!/bin/bash
# Check if self.mapping is reassigned or mutated after line 361
sed -n '345,450p' tensorrt_llm/_torch/distributed/communicator.py

Length of output: 3687



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the Mapping class definition and __init__ method to see all fields
sed -n '351,450p' tensorrt_llm/mapping.py

Length of output: 1446


🏁 Script executed:

#!/bin/bash
# Look at the complete Mapping class definition including constructor
ast-grep --pattern 'class Mapping {
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Use rg to find the __init__ method of Mapping class with full signature
rg -A 40 "class Mapping\(MappingBase\)" tensorrt_llm/mapping.py

Length of output: 572



---

🏁 Script executed:

```shell
#!/bin/bash
# Find the __init__ method in Mapping class
rg -n "def __init__" tensorrt_llm/mapping.py | head -5

Length of output: 132


🏁 Script executed:

#!/bin/bash
# Get the full __init__ method of Mapping, starting from line search
sed -n '450,600p' tensorrt_llm/mapping.py

Length of output: 5541


🏁 Script executed:

#!/bin/bash
# Search for Mapping's __init__ with full parameter list
rg -A 60 "class Mapping\(MappingBase\):" tensorrt_llm/mapping.py | grep -A 50 "def __init__"

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find all __init__ in mapping.py and show which line each one is on
rg -n "def __init__" tensorrt_llm/mapping.py

Length of output: 132


Forward all Mapping constructor parameters when creating mapping_without_helix

The review identifies two valid concerns:

  1. Deep copy is unnecessary: Since the original mapping is never mutated, a simple alias (mapping_with_helix = self.mapping) is sufficient instead of copy.deepcopy(self.mapping).

  2. Missing constructor parameters: The Mapping class accepts 15 parameters, but mapping_without_helix only forwards 9 of them. Missing parameters are:

    • pp_partition
    • moe_cluster_size
    • moe_tp_size
    • attn_tp_size
    • attn_cp_size
    • enable_lm_head_tp_in_adp

Omitting these fields could cause incorrect communication group initialization or future incompatibility as the code evolves. Forward all parameters from the original mapping to ensure the temporary mapping_without_helix has the same configuration.

🧰 Tools
🪛 Ruff (0.14.5)

349-349: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/distributed/communicator.py around lines 345 to 361,
replace the unnecessary deep copy and ensure mapping_without_helix forwards all
Mapping constructor parameters from the original mapping: set mapping_with_helix
= self.mapping (no deepcopy) and construct mapping_without_helix by passing
every field from self.mapping but with cp_size=1 and
tp_size=self.mapping.tp_size * self.mapping.cp_size, explicitly including
pp_partition, moe_cluster_size, moe_tp_size, attn_tp_size, attn_cp_size,
enable_lm_head_tp_in_adp (and any other parameters the Mapping constructor
expects) so the temporary mapping preserves all original settings except the
repurposed CP/TP sizes.

Comment on lines 347 to 371
mapping_with_helix = None
if self.mapping.cp_size > 1:
print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
mapping_with_helix = copy.deepcopy(self.mapping)
mapping_without_helix = Mapping(
world_size=self.mapping.world_size,
rank=self.mapping.rank,
gpus_per_node=self.mapping.gpus_per_node,
cp_size=1,
cp_config={},
tp_size=self.mapping.tp_size * self.mapping.cp_size,
pp_size=self.mapping.pp_size,
moe_ep_size=self.mapping.moe_ep_size,
enable_attention_dp=self.mapping.enable_attention_dp)
self.mapping = mapping_without_helix
self.create_tp_comm()
self.create_pp_comm()
self.create_cp_comm()

# Restore the original mapping.
if mapping_with_helix is not None:
print(f"[MPIDist::__init__] Restoring original mapping.")
self.mapping = mapping_with_helix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace bare print(f"...") with logger and drop unused f prefix

Using print(f"...") here both trips Ruff (F541) and is inconsistent with the rest of this module’s logging style. Suggest switching to the existing logger and removing the unnecessary f:

-        if self.mapping.cp_size > 1:
-            print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
+        if self.mapping.cp_size > 1:
+            logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
@@
-        # Restore the original mapping.
-        if mapping_with_helix is not None:
-            print(f"[MPIDist::__init__] Restoring original mapping.")
-            self.mapping = mapping_with_helix
+        # Restore the original mapping.
+        if mapping_with_helix is not None:
+            logger.info("[MPIDist::__init__] Restoring original mapping.")
+            self.mapping = mapping_with_helix

This keeps logs consistent and satisfies the linter.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mapping_with_helix = None
if self.mapping.cp_size > 1:
print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
mapping_with_helix = copy.deepcopy(self.mapping)
mapping_without_helix = Mapping(
world_size=self.mapping.world_size,
rank=self.mapping.rank,
gpus_per_node=self.mapping.gpus_per_node,
cp_size=1,
cp_config={},
tp_size=self.mapping.tp_size * self.mapping.cp_size,
pp_size=self.mapping.pp_size,
moe_ep_size=self.mapping.moe_ep_size,
enable_attention_dp=self.mapping.enable_attention_dp)
self.mapping = mapping_without_helix
self.create_tp_comm()
self.create_pp_comm()
self.create_cp_comm()
# Restore the original mapping.
if mapping_with_helix is not None:
print(f"[MPIDist::__init__] Restoring original mapping.")
self.mapping = mapping_with_helix
mapping_with_helix = None
if self.mapping.cp_size > 1:
logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
mapping_with_helix = copy.deepcopy(self.mapping)
mapping_without_helix = Mapping(
world_size=self.mapping.world_size,
rank=self.mapping.rank,
gpus_per_node=self.mapping.gpus_per_node,
cp_size=1,
cp_config={},
tp_size=self.mapping.tp_size * self.mapping.cp_size,
pp_size=self.mapping.pp_size,
moe_ep_size=self.mapping.moe_ep_size,
enable_attention_dp=self.mapping.enable_attention_dp)
self.mapping = mapping_without_helix
self.create_tp_comm()
self.create_pp_comm()
# Restore the original mapping.
if mapping_with_helix is not None:
logger.info("[MPIDist::__init__] Restoring original mapping.")
self.mapping = mapping_with_helix
🧰 Tools
🪛 Ruff (0.14.5)

349-349: f-string without any placeholders

Remove extraneous f prefix

(F541)


367-367: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/distributed/communicator.py around lines 347 to 368,
replace the two bare print(f"...") calls with the module logger (e.g.,
logger.info or logger.debug) and remove the unused f-string prefix; keep the
same message text but pass it as a normal string to logger (e.g.,
logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") and
logger.info("[MPIDist::__init__] Restoring original mapping.")), ensuring the
module's logger is used consistently and no unused f-strings remain.

@brb-nv brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from 6f7ffc7 to ec20a04 Compare November 22, 2025 04:04
@brb-nv brb-nv requested a review from a team as a code owner November 23, 2025 01:17
@brb-nv brb-nv requested a review from chuangz0 November 23, 2025 01:17
@brb-nv brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 4 times, most recently from ec9faa5 to 7eabb38 Compare November 23, 2025 03:50
Comment on lines +1599 to +1608
model_config.mapping = Mapping(
world_size=model_config.mapping.world_size,
rank=model_config.mapping.rank,
gpus_per_node=model_config.mapping.gpus_per_node,
cp_size=1,
cp_config={},
tp_size=original_tp_size * original_cp_size,
pp_size=model_config.mapping.pp_size,
moe_ep_size=model_config.mapping.moe_ep_size,
enable_attention_dp=model_config.mapping.enable_attention_dp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic appears multiple times, maybe we can wrap it as a method of Mapping like repurpose_kvp_to_tp?

Comment on lines +302 to +308
@click.option("--cp_size",
type=int,
default=1,
help='Context parallelism size.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add cp_size to trtllm-bench and trtllm-eval

if self.mapping.cp_size > 1:
logger.info(
f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
mapping_with_helix = copy.deepcopy(self.mapping)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like mapping_with_helix is same to mapping_with_cp, could we unify the naming?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, +1 to update this to mapping_with_cp here as everything else in mapping refers to it as CP

print(
f"[DeepseekV3ForCausalLM::__init__] Repurposing KVP ranks to TP while keeping other details the same."
)
self.mapping_with_cp = copy.deepcopy(model_config.mapping)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the only difference between mapping_with_cp and mapping is about the repurposed tp_size and cp_size. If so, is it possible to unify the two mapping objects to one (instead of the duplication)?

For example, we can use a subclass HelixMapping which has a flag indicating whether it's "repurposed", and this flag affects the values accessed via mapping.tp_size and mapping.cp_size (probably two properties).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the issue was that the mapping is being passed around quite a bit, and then modules are set up depending on the values in the mapping. So using a sub-class + a repurposed flag may still be quite tricky to get right because it's hard to set the flag at the right time during __init__ of those sub-modules.
If we could easily set the flag, we could have also easily just updated the model_config.mapping or some other mapping object in place here, but unfortunately, it's not that easy.
If you have a suggestion which is passing integration tests, I think we'd be happy to use that !

Copy link
Collaborator

@chuangz0 chuangz0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me for disagg part

Copy link
Collaborator

@MatthiasKohl MatthiasKohl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly minor things, overall LGTM

if self.mapping.cp_size > 1:
logger.info(
f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
mapping_with_helix = copy.deepcopy(self.mapping)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, +1 to update this to mapping_with_cp here as everything else in mapping refers to it as CP

print(
f"[DeepseekV3ForCausalLM::__init__] Repurposing KVP ranks to TP while keeping other details the same."
)
self.mapping_with_cp = copy.deepcopy(model_config.mapping)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the issue was that the mapping is being passed around quite a bit, and then modules are set up depending on the values in the mapping. So using a sub-class + a repurposed flag may still be quite tricky to get right because it's hard to set the flag at the right time during __init__ of those sub-modules.
If we could easily set the flag, we could have also easily just updated the model_config.mapping or some other mapping object in place here, but unfortunately, it's not that easy.
If you have a suggestion which is passing integration tests, I think we'd be happy to use that !

self.num_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
assert self.num_heads == self.num_key_value_heads, "num_heads must be equal to num_key_value_heads"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to remove this again, because latest main has some cases where num_heads and num_key_value_heads are different for DSA, but I'm not 100% sure.

# all_scenarios[15],
# all_scenarios[21],
# all_scenarios[22],
all_scenarios[-1],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we only want to test the small ctx_len ultimately or is this left over from debugging?

@laikhtewari
Copy link
Collaborator

Where is usage documented? I don't see any docs in the changed files list

Signed-off-by: Balaram Buddharaju <[email protected]>

add ds-lite tllm-gen based disagg test

Signed-off-by: Matthias Jouanneaux <[email protected]>

initial support for helix parallelism

Signed-off-by: Matthias Jouanneaux <[email protected]>

fixed mapping tests, added working MLA module test, added disagg test for helix (WIP)

Signed-off-by: Matthias Jouanneaux <[email protected]>

Helix MLA module test: added more scenarios, removed unnecessary code

Signed-off-by: Matthias Jouanneaux <[email protected]>

MLA Helix test: restricting number of tests, better output

Signed-off-by: Matthias Jouanneaux <[email protected]>

test MLA helix: remove OOM test scenario

Signed-off-by: Matthias Jouanneaux <[email protected]>

test MLA helix: fix scenario max position embeddings

Signed-off-by: Matthias Jouanneaux <[email protected]>

test Helix MLA: try to fix NaNs

Signed-off-by: Matthias Jouanneaux <[email protected]>

added all-to-all impl

Signed-off-by: Matthias Jouanneaux <[email protected]>

fix thop lib

Signed-off-by: Matthias Jouanneaux <[email protected]>

fix alltoall

Signed-off-by: Matthias Jouanneaux <[email protected]>

attention MLA: remove kv heads (unused), improve heads naming, fix tests

Signed-off-by: Matthias Jouanneaux <[email protected]>

test Helix MLA: minor fixes

Signed-off-by: Matthias Jouanneaux <[email protected]>

test Helix MLA: disable numeric test

Signed-off-by: Matthias Jouanneaux <[email protected]>

test Helix MLA: add TODOs to MLA module

Signed-off-by: Matthias Jouanneaux <[email protected]>

test Helix MLA: fix MLA module

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

fully working MLA test

Signed-off-by: Matthias Jouanneaux <[email protected]>

attempt to make latent cache work

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging numerical issue

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging numerical issue

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging numerical issue

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging numerical issue

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging numerical issue

Signed-off-by: Matthias Jouanneaux <[email protected]>

adding additional test for further numerical debugging

Signed-off-by: Matthias Jouanneaux <[email protected]>

fixing tests & correction

Signed-off-by: Matthias Jouanneaux <[email protected]>

remove debug output from tests

Signed-off-by: Matthias Jouanneaux <[email protected]>

fix tests

Signed-off-by: Matthias Jouanneaux <[email protected]>

further debugging with multiple sequences

Signed-off-by: Matthias Jouanneaux <[email protected]>

further debugging with multiple sequences

Signed-off-by: Matthias Jouanneaux <[email protected]>

further debugging with multiple sequences

Signed-off-by: Matthias Jouanneaux <[email protected]>

fixed multiple sequences tests

Signed-off-by: Matthias Jouanneaux <[email protected]>

automated review comments

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging of latent cache

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging of latent cache

Signed-off-by: Matthias Jouanneaux <[email protected]>

further debugging of pe values

Signed-off-by: Matthias Jouanneaux <[email protected]>

further debugging of latent cache

Signed-off-by: Matthias Jouanneaux <[email protected]>

fixed latent cache, remove flaky test

Signed-off-by: Matthias Jouanneaux <[email protected]>

better reporting

Signed-off-by: Matthias Jouanneaux <[email protected]>

better reporting

Signed-off-by: Matthias Jouanneaux <[email protected]>

finalized test scenarios

Signed-off-by: Matthias Jouanneaux <[email protected]>

better perf measurements, added graph support

Signed-off-by: Matthias Jouanneaux <[email protected]>

added helix post process kernel

Signed-off-by: Matthias Jouanneaux <[email protected]>

added unit test, minor fix for helix kernel

Signed-off-by: Matthias Jouanneaux <[email protected]>

fixing helix kernels

Signed-off-by: Matthias Jouanneaux <[email protected]>

better tests, minor fixes

Signed-off-by: Matthias Jouanneaux <[email protected]>

better tests, minor fixes

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging helix test

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging helix test

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging helix test

Signed-off-by: Matthias Jouanneaux <[email protected]>

fixed helix post process kernel: main kernel had perf issue/flaw

Signed-off-by: Matthias Jouanneaux <[email protected]>

fixed helix post process test

Signed-off-by: Matthias Jouanneaux <[email protected]>

added helix full layer test

Signed-off-by: Matthias Jouanneaux <[email protected]>

fix full layer helix test/bench

Signed-off-by: Matthias Jouanneaux <[email protected]>

added correct mapping to ds helix

Signed-off-by: Matthias Jouanneaux <[email protected]>

further improvements for fp8 init

Signed-off-by: Matthias Jouanneaux <[email protected]>

debugging quantization config

Signed-off-by: Matthias Jouanneaux <[email protected]>

better debug output

Signed-off-by: Matthias Jouanneaux <[email protected]>

fixes for fp8

Signed-off-by: Matthias Jouanneaux <[email protected]>

fix fp8 runs

Signed-off-by: Matthias Jouanneaux <[email protected]>

attempt to fix fp8 context

Signed-off-by: Matthias Jouanneaux <[email protected]>

fix context phase: just randomly gen kv cache values. fix scenario sizes

Signed-off-by: Matthias Jouanneaux <[email protected]>

fix tp size config in helix layer test

Signed-off-by: Matthias Jouanneaux <[email protected]>

minor changes for test

get trtllm-serve working with BF16 for gen with cp - v_b_proj weight loading needs to be revisited

$ CUDA_VISIBLE_DEVICES=0,1 trtllm-serve /home/scratch.trt_llm_data/llm-models/DeepSeek-V3-Lite/bf16/ --host localhost --port 8002 --cp_size 2 --extra_llm_api_options ./gen_extra-llm-api-config.yaml

end-to-end test in disagg works

$ pytest tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix -s -v

Switch to contiguous block dist among CP rank

save changes to _merge_requests()

undo changes to prepare_inputs()

Raise exception for blocks fewer than num_cp_ranks

save intermediate changes

attempt to fix attention tests

Signed-off-by: Matthias Jouanneaux <[email protected]>

save changes for minimal test

save minor dev comments

added helix inactive rank option to MLA kernels

Signed-off-by: Matthias Jouanneaux <[email protected]>

pass the right seq_lens_kv - test with seqlen 64 works

$ pytest tests/unittest/_torch/modules/test_mla_helix_expt.py -s -v

is_inactive_helix at request level

cp_allgather for position_id

helix: make inactive rank a bool tensor

Signed-off-by: Matthias Jouanneaux <[email protected]>

undo mapping changes to modeling_deepseek

Failed attempt to replace model_config.mapping

fill in helix_is_inactive for each request

update position_id logic

better way to package mapping - repurpose comms creation too

save disagg gen-only benchmark test

prep for integration test

improvements to position_id, num_cached_tokens_per_seq and tokens_per_block

changes to save blocks at prefill

changes to save blocks at decode

add changes to read KV from disk

updates to save and read KV blocks for all layers

over-allocate at prefill to get cache transmission right

prune saved KV cache files

updates to avoid over-allocation on gen side in disagg

Revert "over-allocate at prefill to get cache transmission right"

This reverts commit af7d000.

save disagg configs for DSV3 - currently goes OOM

verifying tests on 8 GPUs

helix: added (working) DS R1 8-GPU integration test

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix: added large prompt + ds lite config using large prompt

Signed-off-by: Matthias Jouanneaux <[email protected]>

save intermediate changes for fixes

fix debug printing

Signed-off-by: Matthias Jouanneaux <[email protected]>

Mention cache_transceiver_config.max_tokens_in_buffer for disagg servers

save initial changes to benchmarking script

added mjoux specific submit script, tighter timeouts, better defaults

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix slurm: increase timeouts slightly, use deepgemm moe backend for smaller models

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix slurm: add dataset caching path

Signed-off-by: Matthias Jouanneaux <[email protected]>

fix padding when input_len is divisible by tokens_per_block

save changes to test varying prompt len

fix_kvcache_split

Signed-off-by: Chuang Zhu <[email protected]>

avoid fabric memory and print send and recv sizes

auto-determine transceiver size

Signed-off-by: Matthias Jouanneaux <[email protected]>

remove verbose print output

Signed-off-by: Matthias Jouanneaux <[email protected]>

attempt to fix DS R1 run

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix slurm: fix parameters for DS R1 up to 256K tokens

Signed-off-by: Matthias Jouanneaux <[email protected]>

minor updates to reduce memory footprint and bring back warmup

enable cudagraph and add some debug prints

ugly hack to get results with 512k

updates to benchmark 1M seqlen

updates to benchmark 2M seqlen

updates for passing down moe properly

minor changes to get nsys profiles

test helix layer: support for slurm call, support for fp4

Signed-off-by: Matthias Jouanneaux <[email protected]>

test helix layer: added sbatch script

Signed-off-by: Matthias Jouanneaux <[email protected]>

add minimal cache transmission test for 1M seqlen

minor bug fix

changes to benchmark 4M seqlen

skip launch/wait of context servers when TRTLLM_DISAGG_BENCHMARK_GEN_ONLY=1

remove hacks; skip profiling; gpu_mem_frac

test helix layer: fix nvfp4 config to fit high perf mode

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix single layer: improved timing, added arg parsing, added output parsing

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix single layer: add dense option

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix slurm: fix gen_only config, support EP config, add submit script for multiple configs, remove build_wheel by default for array benchmarking

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix slurm: added parse script for results

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix single layer: fixed test, added config submit script, improved parsing

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix single layer: fix segment for sbatch script

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix: fixed TP-only runs (removed hack to make higher seq len work), improved sbatch scripts

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix: fix high node count runs, move back to e2e mode, improve parse script

Signed-off-by: Matthias Jouanneaux <[email protected]>

longer prompt for DSV3 Lite & DSR1 FP4 integration test

disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix
disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_tllm_gen_helix
disaggregated/test_disaggregated.py::test_disaggregated_deepseek_r1_fp4_tllm_gen_helix

helix: added initial README for testing/benchmarking

Signed-off-by: Matthias Jouanneaux <[email protected]>

helix slurm: remove references to internal clusters

Signed-off-by: Matthias Jouanneaux <[email protected]>

minor updates to README

minor updates

helix: improve transpose/split for alltoall

Signed-off-by: Matthias Jouanneaux <[email protected]>

Revert "helix: improve transpose/split for alltoall"

This reverts commit c8b24b9.

helix: improve alltoall perf

Signed-off-by: Matthias Jouanneaux <[email protected]>

[https://nvbugs/5495789][feat] Optionally disable server GC and worker GC (NVIDIA#7995)

Signed-off-by: Tailing Yuan <[email protected]>

save changes for custom logging

redo cherry-pick of attention.py

save more changes for build and pipe-cleaning

save more changes

clean up - 1

clean up - 2

reuse mla_tensor_params instead of using helix_tensor_params

undo all_tp_rank_num_tokens

update test_disaggregated.py

updates to dsv3RopeOp

more cleanup

save fp8 disagg test

[https://nvbugs/5637012][fix] Fix helix unit tests

Signed-off-by: Balaram Buddharaju <[email protected]>

minor updates to attention.py

updates to test - seqlen 64 works

get integration test working
@brb-nv brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from 7eabb38 to 3d11205 Compare November 25, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants