[None][feat] Integrate helix parallelism #9342

brb-nv · 2025-11-20T20:41:18Z

Description

This MR integrates helix parallelism, an experimental feature, in TRTLLM.

Background:

Helix parallelism is a decode-only context parallelism method. Hence, it's used in disaggregated setting where only gen servers would have helix.
This involves sharding the request's seqlen across multiple CP (context parallel) ranks.
For a given query token in decode phase, “local attention” is computed w.r.t previous tokens on each CP rank.
Ensuing communication among CP ranks enables “correction” of local attention such that attention computation is exact.
Given KV parallelism is applicable only to attn layer, CP GPUs are "repurposed" to TP GPUs for FFN layer.

Changes in this MR:

At a broader level, we enable helix parallelism with DeepseekV3 and add a disagg integration test (a smoke test for now).
Example to explain the core changes:
- Suppose we are dealing with the first decode step for a request with ISL 7 and gen server has two-way context parallelism i.e. cpSize=2.
- Let's say first 4 tokens reside on cpRank0 and next 3 tokens reside on cpRank1.
- We have an incoming query token, q7 (corresponding to first generated token). While we perform local attn computation wrt to q7 on both cpRanks, its KV cache is written only to one cpRank (rank1 in the example) and the kv7 is also considered in local attn only on that rank. We call this rank "active helix rank".
Known limitation: Currently only the last CP rank is considered active rank. This shall be lifted in a follow-up MR.

Most changes in this MR enforce this:

KV cache is added for query token only on active rank in resource_manager.py.
Actual KV cache write happens in mla rope kernels and changes to rope kernels skip writing KV cache on inactive ranks.
The number of tokens considered in local attn computation is determined by seq_len_kv in trtllm.py which is also adjusted accordingly.

"Repurposing" attn CP ranks to FFN TP ranks can make things quite messy. To keep this readable,

We pass mapping with CP only to the attention layers in modeling_deepseekv3.py and pass mapping without cp to the rest.
We use a similar trick in communicator.py to obtain the right TP groups.

Test Coverage

$ pytest tests/unittest/_torch/modules/test_mla_helix.py -s -v
$ TRTLLM_USE_UCX_KVCACHE=1 TLLM_LOG_LEVEL=INFO pytest tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix -s -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

New Features
- Added context parallelism support with Helix-based distributed inference capabilities
- DeepSeekV3 model now supports context parallelism for enhanced performance on multi-GPU setups
- New --cp_size command-line argument for configuring context parallel size (default: 1)
- Enhanced disaggregated serving configuration for context-tensor parallel distribution
Tests
- Added new test configuration for disaggregated DeepSeekV3 inference with context parallelism

_{✏️ Tip: You can customize this high-level summary in your review settings.}

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

coderabbitai · 2025-11-21T18:05:07Z

📝 Walkthrough

Walkthrough

This pull request implements context parallelism support with Helix configuration across the TensorRT-LLM inference stack. It adds per-rank inactivity tracking (helix_is_inactive_rank) to CUDA kernels and Python layers, introduces CP size configuration parameters, implements mapping repurposing logic for CP/TP distribution, and extends model initialization and executor logic to handle inactive Helix ranks during generation.

Changes

Cohort / File(s)	Summary
CUDA Kernel Signatures `cpp/tensorrt_llm/kernels/mlaKernels.cu`, `cpp/tensorrt_llm/kernels/mlaKernels.h`	Added `helix_is_inactive_rank` boolean pointer parameter to MLA rope generation kernel signatures; threaded through kernel invocations to gate token processing and K/V updates based on rank inactivity status.
Tensor Operations & Rope Generation `cpp/tensorrt_llm/thop/attentionOp.cpp`, `cpp/tensorrt_llm/thop/dsv3RopeOp.cpp`	Extended MLA tensor parameter handling to expect and forward two tensors (helix_position_offsets, helix_is_inactive_rank); added new field to MlaRopeGenArgs struct and propagated inactive rank mask through rope generation pipelines.
Torch Attention Backend `tensorrt_llm/_torch/attention_backend/trtllm.py`	Added `helix_position_offsets` and `helix_is_inactive_rank` to plan/forward/mla_rope_generation APIs; extended TrtllmAttentionMetadata with inactive rank tracking; adjusted KV length planning to exclude inactive rank contributions.
Distributed Communication `tensorrt_llm/_torch/distributed/communicator.py`	Implemented early CP communicator creation and mapping repurposing logic: when cp_size > 1, creates a copy with Helix mapping, scales TP by CP size, and restores original mapping after TP/PP communicator initialization.
Model Architecture `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Extended DeepseekV3 layer constructors with optional `mapping_with_cp` parameter; added CP rank/size extraction and weight-split logic for KV projection; implemented mapping repurposing during model initialization for cp_size > 1.
Attention Modules `tensorrt_llm/_torch/modules/attention.py`	Added `mapping_with_cp` parameter to MLA and Attention constructors; enforced num_heads equality and Helix CP type validation; updated forward paths to propagate helix parameters and support position_ids threading.
Executor & Resource Management `tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`, `tensorrt_llm/_torch/pyexecutor/llm_request.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Added `py_helix_is_inactive_rank` flag to LlmRequest; implemented helix inactive rank tracking in model engine with conditional position/token calculations; gated KV cache allocation for inactive ranks in resource manager; extended AttentionMetadata with inactive rank exposure.
CLI & Configuration `examples/llm-api/quickstart_advanced.py`, `tensorrt_llm/commands/serve.py`	Added `--cp_size` and `cp_config` command-line arguments; propagated context_parallel_size through LLM initialization; implemented cp_type string-to-enum conversion with validation.
Infrastructure & Mapping `tensorrt_llm/llmapi/disagg_utils.py`, `tensorrt_llm/mapping.py`	Updated instance rank calculation to include context_parallel_size; added hardcoded Helix CP type fallback when cp_size > 1 to override externally provided cp_config.
Test Infrastructure `tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml`, `tests/integration/defs/disaggregated/test_disaggregated.py`	Added new disaggregated test configuration file for context TP and generation Helix setup; introduced test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix test case with model symlink setup.

Sequence Diagram(s)

sequenceDiagram
    participant Request
    participant ResourceMgr as Resource<br/>Manager
    participant ModelEngine
    participant AttentionBE as Attention<br/>Backend
    participant MLAKernel as MLA<br/>Kernel

    Request->>ResourceMgr: prepare_resources()
    activate ResourceMgr
    alt cp_size > 1 and not last rank
        ResourceMgr->>ResourceMgr: mark py_helix_is_inactive_rank=true
        ResourceMgr->>ResourceMgr: skip KV cache allocation
    else active rank
        ResourceMgr->>ResourceMgr: allocate KV cache normally
    end
    deactivate ResourceMgr

    Request->>ModelEngine: forward pass (generation)
    activate ModelEngine
    alt helix_is_inactive_rank[batch]==true
        ModelEngine->>ModelEngine: fix past_seen_token_num<br/>(no increment)
        ModelEngine->>ModelEngine: skip token processing
    else active
        ModelEngine->>ModelEngine: increment past_seen_token_num
        ModelEngine->>AttentionBE: plan() with helix params
    end
    deactivate ModelEngine

    AttentionBE->>AttentionBE: adjust kv_lens planning<br/>(exclude inactive ranks)
    AttentionBE->>MLAKernel: mla_rope_generation<br/>(helix_is_inactive_rank)
    activate MLAKernel
    alt helix_is_inactive_rank[batch]==true
        MLAKernel->>MLAKernel: skip token processing
        MLAKernel->>MLAKernel: skip K/V updates
    else active
        MLAKernel->>MLAKernel: apply rope & assign QKV
        MLAKernel->>MLAKernel: update K/V cache
    end
    deactivate MLAKernel

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

Mapping repurposing logic (communicator.py, modeling_deepseekv3.py, mapping.py): Core logic for switching between CP and TP distributions; mutations and restorations must be correctly sequenced and scoped to avoid state leaks.
KV length planning adjustments (trtllm.py, model_engine.py): Changes to how KV cache lengths are calculated when inactive ranks are present; verify accounting is correct for all rank states.
Warmup control flow (model_engine.py): Conditional position_id and past_seen_token_num calculations based on warmup state and inactivity; ensure all branches are consistent.
Cross-layer parameter threading (executor_request_queue.py, model_engine.py, resource_manager.py): helix_is_inactive_rank flows through multiple abstraction layers; verify end-to-end propagation and type conversions (bool → tensor → pointer).
Model initialization side effects (modeling_deepseekv3.py): Temporary mapping mutations during model construction; verify original mapping is reliably restored even on error paths.

Suggested reviewers

schetlur-nv
nvchenghaoz
Shixiaowei02
Superjomn
Tabrizian
Funatiq
QiJune

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.04% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][feat] Integrate helix parallelism' directly and accurately summarizes the main change—integrating helix parallelism support into TensorRT-LLM, which is clearly the primary focus of this PR.
Description check	✅ Passed	The PR description provides a comprehensive explanation of helix parallelism, background context, specific implementation details with an example, test coverage commands, and confirmation of the PR checklist.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/mapping.py (1)
455-491: Don’t silently override cp_config to HELIX for any cp_size > 1

This block in Mapping.__init__:
#################################################################
# TODO: Remove this hardcoding and obtain cp_config from llm_args.
if cp_size > 1:
    cp_config = {"cp_type": CpType.HELIX}
#################################################################
has broad side effects:

Any caller that provides a non-Helix cp_config (e.g. STAR or ULYSSES) with cp_size > 1 now gets that configuration silently discarded and treated as HELIX.

Code that branches on cp_config["cp_type"] (e.g. _merge_requests in executor_request_queue.py, STAR attention paths, etc.) will never see CpType.STAR/ULYSSES once cp_size > 1, effectively breaking those CP modes.

Additional cp_config fields (like STAR’s block_size / cp_anchor_size, or future Helix parameters) are lost.

If the intent is “for now we only support Helix when cp_size > 1”, it’s safer to:

Only inject a default when cp_config is missing; and

Fail fast on conflicting configs instead of overriding them:
# Temporary default until cp_config is fully plumbed from llm_args.
if cp_size > 1:
    if cp_config is None:
        cp_config = {"cp_type": CpType.HELIX}
    elif cp_config.get("cp_type") != CpType.HELIX:
        raise ValueError(
            f"Only CpType.HELIX is currently supported when cp_size > 1; got {cp_config.get('cp_type')!r}"
        )
That keeps Helix as the only supported multi-CP mode in this PR, but avoids surprising behavior for existing STAR/ULYSSES configs and makes future extension to other CP types straightforward.
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
1568-1623: Tighten helix_is_inactive_rank initialization guard; verify warmup dummy request semantics for Helix

The new Helix logic is mostly sound, but there is one definite initialization bug and one edge case to confirm:
helix_is_inactive_rank initialization guard is incorrect

The current initialization at line 1568:
helix_is_inactive_rank = [] if self.mapping.cp_size > 1 else None
initializes to an empty list for all CP configurations with cp_size > 1, but has_cp_helix() returns True only when both cp_size > 1 and cp_type == CpType.HELIX. For non-Helix CP types (e.g., regular CP or other variants), this creates an empty list that never gets populated, diverging from the None state that downstream consumers expect when Helix is disabled.

Fix: Change line 1568 to:
helix_is_inactive_rank = [] if self.mapping.has_cp_helix() else None
Warmup + Helix: past_seen_token_num override semantics need verification

During warmup, you correctly skip the position_id computation (line 1605), but past_seen_token_num is unconditionally overridden based on request.orig_prompt_len (lines 1608–1612) whenever Helix is active. This is fed into num_cached_tokens_per_seq, which becomes part of KVCacheParams. For dummy warmup requests, ensure:

orig_prompt_len is consistently initialized for all dummy request types created during warmup, and

the resulting KV cache index values remain within valid bounds on inactive Helix ranks.

Per-request inactivity flag wiring looks correct

The per-beam append pattern (lines 1572–1619) produces a helix_is_inactive_rank list with length equal to the total batch size (sum of beam widths), which matches the attention backend's [batch_size] expectation.
2526-2537: Behavioral inconsistency confirmed: ULYSSES passes warmup checks but fails at runtime

The review concern is valid. I found that the change introduces a systemic breaking behavior across three PyExecutor methods:

model_engine.py._prepare_inputs (line 2536): Raises for non-STAR/HELIX

executor_request_queue.py._merge_requests (line 725): Raises for non-STAR/HELIX

py_executor.py._update_request_states (line 2072): Raises for non-STAR/HELIX

The critical inconsistency:

Warmup check (model_engine.py line 564) accepts ULYSSES and returns early

Runtime execution (line 2536) raises NotImplementedError if ULYSSES reaches _prepare_inputs

This means if someone configures PyExecutor with cp_type=ULYSSES, it will pass initialization but crash during inference

ULYSSES is defined in the CpType enum and explicitly referenced at line 564, indicating it was intended to be handled. However, no fallback path exists in the three runtime dispatch methods, and no test coverage was found for ULYSSES with PyExecutor. The previous behavior would have silently fallen through to the default _prepare_tp_inputs path.

While there's no evidence that existing code uses ULYSSES with PyExecutor, the enum inclusion and warmup-time acceptance create an expectation of support that the runtime contradicts.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
1446-1451: Fix TP sharding after restoring the original mapping

During DeepseekV3ForCausalLM.__init__ we repurpose CP ranks into TP by installing a temporary Mapping (tp_size = tp * cp). All decoder/MTP modules capture that object via self.mapping. Later we restore model_config.mapping back to the original CP-aware mapping. Here in DeepseekV3MTP.forward, the chunking uses self.model_config.mapping.tp_size/tp_rank, which now point to the restored mapping and no longer match the row-parallel tensors created with the repurposed mapping. On Helix runs (cp_size > 1) this leaves each rank feeding the wrong slice (or no slice) into eh_proj, breaking generation.

Use the same mapping object that the layer captured during init. A minimal fix:
-        tp_size = self.model_config.mapping.tp_size
-        tp_rank = self.model_config.mapping.tp_rank
+        tp_size = self.mapping.tp_size
+        tp_rank = self.mapping.tp_rank
That keeps the MTP sharding consistent with the repurposed TP groups.

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
645-705: Helix merge: avoid hardcoded tokens_per_block and ensure total_input_len_cp is available on children

Two points in the Helix path:

Hardcoded tokens_per_block=32
elif cp_type == CpType.HELIX:
    return self._merge_helix_requests(
        new_requests,
        tokens_per_block=32)
        # tokens_per_block=cp_config['tokens_per_block'])
This ignores any configured Helix block size (e.g. via cp_config['tokens_per_block'] or KV cache config) and makes the behavior fragile if someone changes the configured block size away from 32.

It also repeats a TODO you already noted to remove this hardcoding.

Suggestion:

Prefer pulling from config with a safe default + assert, e.g.:
tokens_per_block = cp_config.get('tokens_per_block', 32)
assert tokens_per_block > 0
return self._merge_helix_requests(new_requests, tokens_per_block=tokens_per_block)
or at minimum assert that a configured value, if present, matches 32 so misconfigurations fail loudly instead of silently diverging.

total_input_len_cp not propagated to child requests
req = executor_request_to_llm_request(...)
req.total_input_len_cp = input_len
req_with_children.append(req)
if req.child_requests:
    req_with_children.extend(req.child_requests)
executor_request_to_llm_request creates child requests via LlmRequest.create_child_request, which only copies attributes whose names start with py_.

As a result, total_input_len_cp exists only on the parent; any downstream code that expects this attribute on every LlmRequest (including children when num_return_sequences > 1) will not find it.

Possible fix:

Either rename to follow the py_ convention so it’s auto-copied:
req.py_total_input_len_cp = input_len
for child in req.child_requests:
    child.py_total_input_len_cp = input_len
Or, if you deliberately want a non-py_ attribute, explicitly set it on children in this loop.

This will keep Helix metadata consistent across parent and child requests and future-proof the code against differing tokens_per_block configs.
#!/bin/bash
# Check how Helix-related fields are used so they stay consistent.
rg -n "total_input_len_cp" -C3
rg -n "tokens_per_block" tensorrt_llm/_torch/pyexecutor -C3
Also applies to: 710-723

🧹 Nitpick comments (5)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

440-449: Decode-time KV allocation correctly gated on active Helix rank

Marking req.py_helix_is_inactive_rank on non-last CP ranks when has_cp_helix() and skipping add_token there ensures only the active Helix rank allocates decode-time KV cache, which matches the design.

You might consider using mapping.is_last_cp_rank() (and/or setting this flag once at request construction) for slightly clearer intent, but the current logic is functionally sound.

examples/llm-api/quickstart_advanced.py (1)

71-76: cp_size flag and context_parallel_size wiring are consistent

The new --cp_size argument and its use as context_parallel_size=args.cp_size in the LLM constructor align with the new CP plumbing. The change is self-contained and doesn’t affect existing callers.

Optionally, you might extend the help string for --cp_size to mention that multi-CP currently implies Helix in this flow, so users know what they’re opting into.

Also applies to: 261-264

tests/integration/defs/disaggregated/test_disaggregated.py (1)

154-274: New DeepSeek V3 Lite bf16 Helix disaggregated test wiring looks consistent

The new config entry and test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix follow the same symlink + run_disaggregated_test pattern as the existing DeepSeek tests, and the test_desc string matches the key added to config_map, so the wiring looks correct.

If you want to silence Ruff’s ARG001 warning, you could rename disaggregated_test_root to _disaggregated_test_root in the new test (or add a # noqa: ARG001), but that’s cosmetic and consistent with the rest of this file.

Also applies to: 1915-1933
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
12-13: Remove duplicate LlmRequest import

LlmRequest is imported twice in this file (here and again at line 62). You can safely drop the earlier import and keep the one that also brings in get_draft_token_length:
-from .llm_request import LlmRequest
-
 import torch
This keeps imports minimal without changing behavior.
tensorrt_llm/commands/serve.py (1)
5-5: Drop the duplicate gc import
Line 2 already imports gc, so this second import triggers Ruff F811 (Redefinition of unused gc). Please drop the extra line to keep lint happy.
-import gc

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2abb8 and 50436a1.

📒 Files selected for processing (18)

cpp/tensorrt_llm/kernels/mlaKernels.cu (4 hunks)
cpp/tensorrt_llm/kernels/mlaKernels.h (1 hunks)
cpp/tensorrt_llm/thop/attentionOp.cpp (2 hunks)
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (6 hunks)
examples/llm-api/quickstart_advanced.py (2 hunks)
tensorrt_llm/_torch/attention_backend/trtllm.py (9 hunks)
tensorrt_llm/_torch/distributed/communicator.py (2 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (12 hunks)
tensorrt_llm/_torch/modules/attention.py (10 hunks)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
tensorrt_llm/commands/serve.py (7 hunks)
tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
tensorrt_llm/mapping.py (1 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated.py (2 hunks)

🧰 Additional context used

🧠 Learnings (27)

📓 Common learnings

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/modules/attention.py

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
cpp/tensorrt_llm/kernels/mlaKernels.cu
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-09-23T14:58:05.372Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

tensorrt_llm/_torch/distributed/communicator.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

tensorrt_llm/mapping.py

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/commands/serve.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tensorrt_llm/commands/serve.py

📚 Learning: 2025-08-27T14:23:55.566Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

tensorrt_llm/_torch/modules/attention.py

🧬 Code graph analysis (12)

tensorrt_llm/_torch/distributed/communicator.py (2)

tensorrt_llm/mapping.py (3)

Mapping (351-515)

rank (187-188)

rank (191-198)

tensorrt_llm/llmapi/llm_args.py (2)

world_size (459-460)

world_size (469-473)

tests/integration/defs/disaggregated/test_disaggregated.py (2)

tests/integration/defs/conftest.py (4)

disaggregated_test_root (2618-2623)

disaggregated_example_root (285-290)

llm_venv (702-719)

deepseek_v3_model_root (616-631)

tests/integration/defs/local_venv.py (1)

get_working_directory (43-49)

cpp/tensorrt_llm/kernels/mlaKernels.cu (1)

cpp/tensorrt_llm/kernels/mlaKernels.h (2)

helix_position_offsets (109-134)

helix_is_inactive_rank (112-113)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)

tensorrt_llm/_torch/distributed/communicator.py (5)

tp_size (64-65)

has_pp (52-53)

cp_size (56-57)

rank (40-41)

rank (457-458)

tensorrt_llm/mapping.py (3)

has_pp (258-259)

rank (187-188)

rank (191-198)

tensorrt_llm/mapping.py (1)

tensorrt_llm/_torch/distributed/communicator.py (2)

cp_size (56-57)

cp_config (108-109)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)

tensorrt_llm/runtime/model_runner.py (1)

mapping (824-825)

tensorrt_llm/_torch/distributed/communicator.py (3)

has_cp_helix (104-105)

cp_rank (68-69)

cp_size (56-57)

tensorrt_llm/mapping.py (2)

has_cp_helix (233-235)

cp_rank (534-535)

tensorrt_llm/_torch/device_mesh.py (1)

cp_rank (84-86)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

tensorrt_llm/_torch/distributed/communicator.py (8)

cp_size (56-57)

cp_rank (68-69)

tp_size (64-65)

world_size (44-45)

rank (40-41)

rank (457-458)

cp_config (108-109)

pp_size (60-61)

tensorrt_llm/mapping.py (4)

cp_rank (534-535)

Mapping (351-515)

rank (187-188)

rank (191-198)

tensorrt_llm/_torch/model_config.py (1)

ModelConfig (75-616)

tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

tensorrt_llm/_torch/pyexecutor/llm_request.py (5)

LlmRequest (437-663)

append (101-127)

append (195-212)

cached_tokens (569-570)

cached_tokens (573-576)

tensorrt_llm/mapping.py (3)

CpType (24-32)

has_cp_helix (233-235)

cp_rank (534-535)

tensorrt_llm/_torch/distributed/communicator.py (3)

cp_size (56-57)

has_cp_helix (104-105)

cp_rank (68-69)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

is_warmup (344-345)

is_warmup (348-353)

tensorrt_llm/commands/serve.py (3)

tensorrt_llm/runtime/model_runner.py (1)

mapping (824-825)

tensorrt_llm/mapping.py (1)

CpType (24-32)

tensorrt_llm/_torch/distributed/communicator.py (4)

cp_config (108-109)

tp_size (64-65)

pp_size (60-61)

cp_size (56-57)

examples/llm-api/quickstart_advanced.py (1)

tensorrt_llm/_torch/distributed/communicator.py (1)

cp_size (56-57)

tensorrt_llm/_torch/attention_backend/trtllm.py (3)

cpp/tensorrt_llm/kernels/mlaKernels.h (2)

helix_position_offsets (109-134)

helix_is_inactive_rank (112-113)

tensorrt_llm/_torch/attention_backend/flashinfer.py (1)

cached_token_lens (116-118)

tensorrt_llm/_torch/attention_backend/interface.py (2)

seq_lens_kv (226-227)

seq_lens_kv (230-237)

tensorrt_llm/_torch/modules/attention.py (3)

cpp/tensorrt_llm/kernels/helixKernels.h (1)

tensorrt_llm (26-46)

tensorrt_llm/mapping.py (1)

CpType (24-32)

cpp/tensorrt_llm/kernels/mlaKernels.h (1)

helix_position_offsets (109-134)

🪛 Ruff (0.14.5)

tensorrt_llm/_torch/distributed/communicator.py

349-349: f-string without any placeholders

Remove extraneous f prefix

(F541)

367-367: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/disaggregated/test_disaggregated.py

1919-1919: Unused function argument: disaggregated_test_root

(ARG001)

tensorrt_llm/_torch/models/modeling_deepseekv3.py

1561-1561: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

1588-1588: f-string without any placeholders

Remove extraneous f prefix

(F541)

1660-1660: f-string without any placeholders

Remove extraneous f prefix

(F541)

tensorrt_llm/commands/serve.py

5-5: Redefinition of unused gc from line 2

Remove definition: gc

(F811)

130-131: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

130-131: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tensorrt_llm/llmapi/disagg_utils.py (1)

199-201: Including PP and CP in instance_num_ranks is consistent with mapping/world-size logic

Using tensor_parallel_size * pipeline_parallel_size * context_parallel_size makes the per-instance rank count match the full mapping used later in split_world_comm, while preserving backward compatibility when PP/CP are omitted (default 1). Looks good.

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1)

1-32: Helix disaggregated gen config is internally consistent

Context and generation sections use TP/CP sizes in a way that matches the updated disaggregation logic (context_parallel_size only on the generation side for Helix decode-only). No issues spotted.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

441-513: py_helix_is_inactive_rank flag wiring is consistent

Initializing self.py_helix_is_inactive_rank = False alongside other py_* fields and relying on create_child_request’s py_ copying is exactly what the resource manager needs for per-request Helix inactivity tracking. No changes requested.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

316-320: CP-aware attachment of Python-only request metadata looks good

Extending the condition to (tp_size > 1 or has_pp or cp_size > 1) ensures CP-only topologies also receive Python-side attributes on non-root ranks, without affecting single-rank setups. This is the right direction.

cpp/tensorrt_llm/thop/attentionOp.cpp (1)

181-238: No changes needed; concerns are not supported by the codebase

This review comment identifies issues that do not exist in the actual code:

Backward compatibility: All current Python call sites already pass exactly 2 tensors in mla_tensor_params (lines 481 and 1771 in tensorrt_llm/_torch/attention_backend/trtllm.py). The size() == 2 check enforces the existing pattern consistently, not a breaking change.

Uninitialized helix fields: The MlaParams struct explicitly initializes both helix_position_offsets{nullptr} and helix_is_inactive_rank{nullptr} as member initializers, ensuring they are always zero-initialized regardless of constructor form. The suggested MlaParams<T> mla_params{} is redundant.

API design: The required parameter without defaults aligns with TensorRT-LLM's intentional design for consistency in tensor parameter collections (documented in learnings from PR 6904).

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/kernels/mlaKernels.h (1)

108-112: Helix inactivity flag plumbed cleanly into MLA params

The new helix_is_inactive_rank pointer mirrors the existing helix_position_offsets pattern and keeps the struct’s nullable‑by‑default contract; no issues spotted here.

cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1)

45-70: Helix inactivity flag is correctly threaded through MLA rope generation

The new helix_is_inactive_rank pointer is plumbed cleanly from mla_tensor_params into MlaRopeGenArgs and then into MlaParams, with a strict but reasonable contract (mla_tensor_params.size() == 2). Optional handling via has_value() ? data_ptr : nullptr matches the existing pattern for helix_position_offsets. No functional issues spotted.

Also applies to: 88-111, 139-168, 277-283
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
545-566: Warmup gating by cp_type now correctly excludes only ULYSSES/STAR

The updated warmup logic:
cp_type = self.mapping.cp_config.get('cp_type', None)
if cp_type is not None:
    if cp_type in [CpType.ULYSSES, CpType.STAR]:
        return
means HELIX (and other non‑ULYSSES/STAR cp types) still run warmup, which is what you want for Helix CUDA graph / torch.compile specialization. This looks consistent with the new Helix integration and doesn’t affect non‑CP runs.

coderabbitai · 2025-11-21T18:05:10Z

tensorrt_llm/_torch/attention_backend/trtllm.py

+        if self.helix_is_inactive_rank is not None and len(
+                self.helix_is_inactive_rank):
+            # If helix is inactive, attend to the previously cached tokens only.
+            # This gets further complicated with multiple requests as each request might
+            # have a different active helix rank.
+            assert cached_token_lens is not None, "cached_token_lens should be set for helix"
+            kv_lens = cached_token_lens
+            helix_is_inactive_rank_cpu = torch.tensor(
+                self.helix_is_inactive_rank,
+                dtype=torch.bool,
+                device='cpu',
+            )
+            active_rank = ~helix_is_inactive_rank_cpu
+            kv_lens[active_rank] += self.seq_lens_kv[active_rank]
+        else:
+            kv_lens = cached_token_lens + self.seq_lens_kv if cached_token_lens is not None else self.seq_lens_kv


⚠️ Potential issue | 🟠 Major

Preserve cached token counts when gating inactive Helix ranks
kv_lens = cached_token_lens keeps both names pointing to the same tensor, so the in-place kv_lens[active_rank] += … also mutates cached_token_lens. Downstream code (e.g., prepare_context_mla_with_cached_kv) expects cached_token_lens to still represent only the pre-existing cache, so the extra in-place additions inflate every consumer (num_ctx_cached_tokens, max_ctx_cached_token_len, etc.). On Helix-enabled runs this corrupts scheduler state and can mis-size allocations or loop bounds.
Grab a detached copy before modifying:

- kv_lens = cached_token_lens + kv_lens = cached_token_lens.clone() kv_lens[active_rank] += self.seq_lens_kv[active_rank]

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/attention_backend/trtllm.py around lines 855-870, kv_lens is set to reference cached_token_lens and then mutated in-place, which corrupts cached_token_lens for downstream consumers; fix by making kv_lens a copy (e.g., kv_lens = cached_token_lens.clone() or cached_token_lens.detach().clone()) so subsequent kv_lens[active_rank] += ... mutates only the local copy and leaves cached_token_lens unchanged, preserving expected cached counts.

coderabbitai · 2025-11-21T18:05:10Z

tensorrt_llm/_torch/distributed/communicator.py

+        self.create_cp_comm()
+        # Repurpose CP ranks to TP for Helix so that the right comms are created.
+        mapping_with_helix = None
+        if self.mapping.cp_size > 1:
+            print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
+            mapping_with_helix = copy.deepcopy(self.mapping)
+            mapping_without_helix = Mapping(
+                world_size=self.mapping.world_size,
+                rank=self.mapping.rank,
+                gpus_per_node=self.mapping.gpus_per_node,
+                cp_size=1,
+                cp_config={},
+                tp_size=self.mapping.tp_size * self.mapping.cp_size,
+                pp_size=self.mapping.pp_size,
+                moe_ep_size=self.mapping.moe_ep_size,
+                enable_attention_dp=self.mapping.enable_attention_dp)
+            self.mapping = mapping_without_helix


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Avoid deep‑copying Mapping and consider reusing more fields in the Helix repurposed mapping

The temporary repurposing of CP into TP is a good approach for Helix, but two details are worth tightening:

Deep copy of mapping is unnecessary and may desync shared state

You currently do:

mapping_with_helix = copy.deepcopy(self.mapping) ... self.mapping = mapping_with_helix

This makes MPIDist.mapping refer to a new Mapping instance, different from the original object passed into the engine, which could be surprising if other components rely on identity or mutate the original mapping later.

Since you don’t mutate the original mapping object here, a plain alias is enough:

mapping_with_helix = None

if self.mapping.cp_size > 1:

...

mapping_with_helix = copy.deepcopy(self.mapping)

mapping_with_helix = None

if self.mapping.cp_size > 1:

...

mapping_with_helix = self.mapping

mapping_without_helix only forwards a subset of Mapping kwargs

Today you only pass world_size, rank, gpus_per_node, {cp,tp,pp}_size, moe_ep_size, and enable_attention_dp. If future code in Mapping starts deriving TP/PP groups from other fields (e.g., pp_partition, moe_cluster_size, moe_tp_size, attn_tp_size, attn_cp_size), mapping_without_helix may no longer faithfully represent the original layout.

Consider either:

adding the missing fields when constructing Mapping(...), or

adding a helper on Mapping (e.g., mapping.repurpose_for_helix_tp()) that centralizes this logic.

That keeps communicator initialization robust to future Mapping changes.

I'll verify the concerns raised in this review comment by investigating the codebase.

--- 🏁 Script executed: ```shell #!/bin/bash # Find the Mapping class definition to understand all available fields rg -n "class Mapping" --type=py

Length of output: 101

🏁 Script executed:

#!/bin/bash # Look at the complete context around lines 345-361 sed -n '330,380p' tensorrt_llm/_torch/distributed/communicator.py

Length of output: 1847

🏁 Script executed:

#!/bin/bash # Check how mapping_with_helix is used after creation rg -n "mapping_with_helix" tensorrt_llm/_torch/distributed/communicator.py -A5 -B2

Length of output: 1008

🏁 Script executed:

#!/bin/bash # Check how mapping_without_helix is used rg -n "mapping_without_helix" tensorrt_llm/_torch/distributed/communicator.py -A3 -B2

Length of output: 620

🏁 Script executed:

#!/bin/bash # Check if self.mapping is reassigned or mutated after line 361 sed -n '345,450p' tensorrt_llm/_torch/distributed/communicator.py

Length of output: 3687

--- 🏁 Script executed: ```shell #!/bin/bash # Get the Mapping class definition and __init__ method to see all fields sed -n '351,450p' tensorrt_llm/mapping.py

Length of output: 1446

🏁 Script executed:

#!/bin/bash # Look at the complete Mapping class definition including constructor ast-grep --pattern 'class Mapping { $$$ }'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Use rg to find the __init__ method of Mapping class with full signature rg -A 40 "class Mapping$MappingBase$" tensorrt_llm/mapping.py

Length of output: 572

--- 🏁 Script executed: ```shell #!/bin/bash # Find the __init__ method in Mapping class rg -n "def __init__" tensorrt_llm/mapping.py | head -5

Length of output: 132

🏁 Script executed:

#!/bin/bash # Get the full __init__ method of Mapping, starting from line search sed -n '450,600p' tensorrt_llm/mapping.py

Length of output: 5541

🏁 Script executed:

#!/bin/bash # Search for Mapping's __init__ with full parameter list rg -A 60 "class Mapping$MappingBase$:" tensorrt_llm/mapping.py | grep -A 50 "def __init__"

Length of output: 0

🏁 Script executed:

#!/bin/bash # Find all __init__ in mapping.py and show which line each one is on rg -n "def __init__" tensorrt_llm/mapping.py

Length of output: 132

Forward all Mapping constructor parameters when creating mapping_without_helix

The review identifies two valid concerns:

Deep copy is unnecessary: Since the original mapping is never mutated, a simple alias (mapping_with_helix = self.mapping) is sufficient instead of copy.deepcopy(self.mapping).

Missing constructor parameters: The Mapping class accepts 15 parameters, but mapping_without_helix only forwards 9 of them. Missing parameters are:

pp_partition

moe_cluster_size

moe_tp_size

attn_tp_size

attn_cp_size

enable_lm_head_tp_in_adp

Omitting these fields could cause incorrect communication group initialization or future incompatibility as the code evolves. Forward all parameters from the original mapping to ensure the temporary mapping_without_helix has the same configuration.

🧰 Tools

🪛 Ruff (0.14.5)

349-349: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/distributed/communicator.py around lines 345 to 361, replace the unnecessary deep copy and ensure mapping_without_helix forwards all Mapping constructor parameters from the original mapping: set mapping_with_helix = self.mapping (no deepcopy) and construct mapping_without_helix by passing every field from self.mapping but with cp_size=1 and tp_size=self.mapping.tp_size * self.mapping.cp_size, explicitly including pp_partition, moe_cluster_size, moe_tp_size, attn_tp_size, attn_cp_size, enable_lm_head_tp_in_adp (and any other parameters the Mapping constructor expects) so the temporary mapping preserves all original settings except the repurposed CP/TP sizes.

coderabbitai · 2025-11-21T18:05:11Z

tensorrt_llm/_torch/distributed/communicator.py

+        mapping_with_helix = None
+        if self.mapping.cp_size > 1:
+            print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
+            mapping_with_helix = copy.deepcopy(self.mapping)
+            mapping_without_helix = Mapping(
+                world_size=self.mapping.world_size,
+                rank=self.mapping.rank,
+                gpus_per_node=self.mapping.gpus_per_node,
+                cp_size=1,
+                cp_config={},
+                tp_size=self.mapping.tp_size * self.mapping.cp_size,
+                pp_size=self.mapping.pp_size,
+                moe_ep_size=self.mapping.moe_ep_size,
+                enable_attention_dp=self.mapping.enable_attention_dp)
+            self.mapping = mapping_without_helix
        self.create_tp_comm()
        self.create_pp_comm()
-        self.create_cp_comm()
+
+        # Restore the original mapping.
+        if mapping_with_helix is not None:
+            print(f"[MPIDist::__init__] Restoring original mapping.")
+            self.mapping = mapping_with_helix


⚠️ Potential issue | 🟡 Minor

Replace bare print(f"...") with logger and drop unused f prefix

Using print(f"...") here both trips Ruff (F541) and is inconsistent with the rest of this module’s logging style. Suggest switching to the existing logger and removing the unnecessary f:

- if self.mapping.cp_size > 1: - print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") + if self.mapping.cp_size > 1: + logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") @@ - # Restore the original mapping. - if mapping_with_helix is not None: - print(f"[MPIDist::__init__] Restoring original mapping.") - self.mapping = mapping_with_helix + # Restore the original mapping. + if mapping_with_helix is not None: + logger.info("[MPIDist::__init__] Restoring original mapping.") + self.mapping = mapping_with_helix

This keeps logs consistent and satisfies the linter.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mapping_with_helix = None

if self.mapping.cp_size > 1:

print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")

mapping_with_helix = copy.deepcopy(self.mapping)

mapping_without_helix = Mapping(

world_size=self.mapping.world_size,

rank=self.mapping.rank,

gpus_per_node=self.mapping.gpus_per_node,

cp_size=1,

cp_config={},

tp_size=self.mapping.tp_size * self.mapping.cp_size,

pp_size=self.mapping.pp_size,

moe_ep_size=self.mapping.moe_ep_size,

enable_attention_dp=self.mapping.enable_attention_dp)

self.mapping = mapping_without_helix

self.create_tp_comm()

self.create_pp_comm()

self.create_cp_comm()

# Restore the original mapping.

if mapping_with_helix is not None:

print(f"[MPIDist::__init__] Restoring original mapping.")

self.mapping = mapping_with_helix

mapping_with_helix = None

if self.mapping.cp_size > 1:

logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")

mapping_with_helix = copy.deepcopy(self.mapping)

mapping_without_helix = Mapping(

world_size=self.mapping.world_size,

rank=self.mapping.rank,

gpus_per_node=self.mapping.gpus_per_node,

cp_size=1,

cp_config={},

tp_size=self.mapping.tp_size * self.mapping.cp_size,

pp_size=self.mapping.pp_size,

moe_ep_size=self.mapping.moe_ep_size,

enable_attention_dp=self.mapping.enable_attention_dp)

self.mapping = mapping_without_helix

self.create_tp_comm()

self.create_pp_comm()

# Restore the original mapping.

if mapping_with_helix is not None:

logger.info("[MPIDist::__init__] Restoring original mapping.")

self.mapping = mapping_with_helix

🧰 Tools

🪛 Ruff (0.14.5)

349-349: f-string without any placeholders

Remove extraneous f prefix

(F541)

367-367: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/distributed/communicator.py around lines 347 to 368, replace the two bare print(f"...") calls with the module logger (e.g., logger.info or logger.debug) and remove the unused f-string prefix; keep the same message text but pass it as a normal string to logger (e.g., logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") and logger.info("[MPIDist::__init__] Restoring original mapping.")), ensuring the module's logger is used consistently and no unused f-strings remain.

syuoni · 2025-11-25T01:42:52Z

tensorrt_llm/_torch/models/modeling_deepseekv3.py

+            model_config.mapping = Mapping(
+                world_size=model_config.mapping.world_size,
+                rank=model_config.mapping.rank,
+                gpus_per_node=model_config.mapping.gpus_per_node,
+                cp_size=1,
+                cp_config={},
+                tp_size=original_tp_size * original_cp_size,
+                pp_size=model_config.mapping.pp_size,
+                moe_ep_size=model_config.mapping.moe_ep_size,
+                enable_attention_dp=model_config.mapping.enable_attention_dp)


This logic appears multiple times, maybe we can wrap it as a method of Mapping like repurpose_kvp_to_tp?

syuoni · 2025-11-25T01:46:35Z

tensorrt_llm/commands/serve.py

+@click.option("--cp_size",
+              type=int,
+              default=1,
+              help='Context parallelism size.')


Please also add cp_size to trtllm-bench and trtllm-eval

syuoni · 2025-11-25T01:50:30Z

tensorrt_llm/_torch/distributed/communicator.py

+        if self.mapping.cp_size > 1:
+            logger.info(
+                f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
+            mapping_with_helix = copy.deepcopy(self.mapping)


Looks like mapping_with_helix is same to mapping_with_cp, could we unify the naming?

agreed, +1 to update this to mapping_with_cp here as everything else in mapping refers to it as CP

syuoni · 2025-11-25T01:58:16Z

tensorrt_llm/_torch/models/modeling_deepseekv3.py

+            print(
+                f"[DeepseekV3ForCausalLM::__init__] Repurposing KVP ranks to TP while keeping other details the same."
+            )
+            self.mapping_with_cp = copy.deepcopy(model_config.mapping)


If I understand correctly, the only difference between mapping_with_cp and mapping is about the repurposed tp_size and cp_size. If so, is it possible to unify the two mapping objects to one (instead of the duplication)?

For example, we can use a subclass HelixMapping which has a flag indicating whether it's "repurposed", and this flag affects the values accessed via mapping.tp_size and mapping.cp_size (probably two properties).

IIRC, the issue was that the mapping is being passed around quite a bit, and then modules are set up depending on the values in the mapping. So using a sub-class + a repurposed flag may still be quite tricky to get right because it's hard to set the flag at the right time during __init__ of those sub-modules.
If we could easily set the flag, we could have also easily just updated the model_config.mapping or some other mapping object in place here, but unfortunately, it's not that easy.
If you have a suggestion which is passing integration tests, I think we'd be happy to use that !

chuangz0

looks good to me for disagg part

MatthiasKohl

mainly minor things, overall LGTM

MatthiasKohl · 2025-11-25T17:27:32Z

tensorrt_llm/_torch/distributed/communicator.py

+        if self.mapping.cp_size > 1:
+            logger.info(
+                f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
+            mapping_with_helix = copy.deepcopy(self.mapping)


agreed, +1 to update this to mapping_with_cp here as everything else in mapping refers to it as CP

MatthiasKohl · 2025-11-25T17:33:00Z

tensorrt_llm/_torch/models/modeling_deepseekv3.py

+            print(
+                f"[DeepseekV3ForCausalLM::__init__] Repurposing KVP ranks to TP while keeping other details the same."
+            )
+            self.mapping_with_cp = copy.deepcopy(model_config.mapping)


IIRC, the issue was that the mapping is being passed around quite a bit, and then modules are set up depending on the values in the mapping. So using a sub-class + a repurposed flag may still be quite tricky to get right because it's hard to set the flag at the right time during __init__ of those sub-modules.
If we could easily set the flag, we could have also easily just updated the model_config.mapping or some other mapping object in place here, but unfortunately, it's not that easy.
If you have a suggestion which is passing integration tests, I think we'd be happy to use that !

MatthiasKohl · 2025-11-25T17:34:22Z

tensorrt_llm/_torch/modules/attention.py

        self.num_heads = num_attention_heads
        self.num_key_value_heads = num_key_value_heads
        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        assert self.num_heads == self.num_key_value_heads, "num_heads must be equal to num_key_value_heads"


I believe we need to remove this again, because latest main has some cases where num_heads and num_key_value_heads are different for DSA, but I'm not 100% sure.

MatthiasKohl · 2025-11-25T17:37:17Z

tests/unittest/_torch/modules/test_mla_helix.py

+    # all_scenarios[15],
+    # all_scenarios[21],
+    # all_scenarios[22],
+    all_scenarios[-1],


do we only want to test the small ctx_len ultimately or is this left over from debugging?

laikhtewari · 2025-11-25T20:38:08Z

Where is usage documented? I don't see any docs in the changed files list

Signed-off-by: Balaram Buddharaju <[email protected]> add ds-lite tllm-gen based disagg test Signed-off-by: Matthias Jouanneaux <[email protected]> initial support for helix parallelism Signed-off-by: Matthias Jouanneaux <[email protected]> fixed mapping tests, added working MLA module test, added disagg test for helix (WIP) Signed-off-by: Matthias Jouanneaux <[email protected]> Helix MLA module test: added more scenarios, removed unnecessary code Signed-off-by: Matthias Jouanneaux <[email protected]> MLA Helix test: restricting number of tests, better output Signed-off-by: Matthias Jouanneaux <[email protected]> test MLA helix: remove OOM test scenario Signed-off-by: Matthias Jouanneaux <[email protected]> test MLA helix: fix scenario max position embeddings Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: try to fix NaNs Signed-off-by: Matthias Jouanneaux <[email protected]> added all-to-all impl Signed-off-by: Matthias Jouanneaux <[email protected]> fix thop lib Signed-off-by: Matthias Jouanneaux <[email protected]> fix alltoall Signed-off-by: Matthias Jouanneaux <[email protected]> attention MLA: remove kv heads (unused), improve heads naming, fix tests Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: minor fixes Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: disable numeric test Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: add TODOs to MLA module Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: fix MLA module Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> fully working MLA test Signed-off-by: Matthias Jouanneaux <[email protected]> attempt to make latent cache work Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> adding additional test for further numerical debugging Signed-off-by: Matthias Jouanneaux <[email protected]> fixing tests & correction Signed-off-by: Matthias Jouanneaux <[email protected]> remove debug output from tests Signed-off-by: Matthias Jouanneaux <[email protected]> fix tests Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging with multiple sequences Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging with multiple sequences Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging with multiple sequences Signed-off-by: Matthias Jouanneaux <[email protected]> fixed multiple sequences tests Signed-off-by: Matthias Jouanneaux <[email protected]> automated review comments Signed-off-by: Matthias Jouanneaux <[email protected]> debugging of latent cache Signed-off-by: Matthias Jouanneaux <[email protected]> debugging of latent cache Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging of pe values Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging of latent cache Signed-off-by: Matthias Jouanneaux <[email protected]> fixed latent cache, remove flaky test Signed-off-by: Matthias Jouanneaux <[email protected]> better reporting Signed-off-by: Matthias Jouanneaux <[email protected]> better reporting Signed-off-by: Matthias Jouanneaux <[email protected]> finalized test scenarios Signed-off-by: Matthias Jouanneaux <[email protected]> better perf measurements, added graph support Signed-off-by: Matthias Jouanneaux <[email protected]> added helix post process kernel Signed-off-by: Matthias Jouanneaux <[email protected]> added unit test, minor fix for helix kernel Signed-off-by: Matthias Jouanneaux <[email protected]> fixing helix kernels Signed-off-by: Matthias Jouanneaux <[email protected]> better tests, minor fixes Signed-off-by: Matthias Jouanneaux <[email protected]> better tests, minor fixes Signed-off-by: Matthias Jouanneaux <[email protected]> debugging helix test Signed-off-by: Matthias Jouanneaux <[email protected]> debugging helix test Signed-off-by: Matthias Jouanneaux <[email protected]> debugging helix test Signed-off-by: Matthias Jouanneaux <[email protected]> fixed helix post process kernel: main kernel had perf issue/flaw Signed-off-by: Matthias Jouanneaux <[email protected]> fixed helix post process test Signed-off-by: Matthias Jouanneaux <[email protected]> added helix full layer test Signed-off-by: Matthias Jouanneaux <[email protected]> fix full layer helix test/bench Signed-off-by: Matthias Jouanneaux <[email protected]> added correct mapping to ds helix Signed-off-by: Matthias Jouanneaux <[email protected]> further improvements for fp8 init Signed-off-by: Matthias Jouanneaux <[email protected]> debugging quantization config Signed-off-by: Matthias Jouanneaux <[email protected]> better debug output Signed-off-by: Matthias Jouanneaux <[email protected]> fixes for fp8 Signed-off-by: Matthias Jouanneaux <[email protected]> fix fp8 runs Signed-off-by: Matthias Jouanneaux <[email protected]> attempt to fix fp8 context Signed-off-by: Matthias Jouanneaux <[email protected]> fix context phase: just randomly gen kv cache values. fix scenario sizes Signed-off-by: Matthias Jouanneaux <[email protected]> fix tp size config in helix layer test Signed-off-by: Matthias Jouanneaux <[email protected]> minor changes for test get trtllm-serve working with BF16 for gen with cp - v_b_proj weight loading needs to be revisited $ CUDA_VISIBLE_DEVICES=0,1 trtllm-serve /home/scratch.trt_llm_data/llm-models/DeepSeek-V3-Lite/bf16/ --host localhost --port 8002 --cp_size 2 --extra_llm_api_options ./gen_extra-llm-api-config.yaml end-to-end test in disagg works $ pytest tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix -s -v Switch to contiguous block dist among CP rank save changes to _merge_requests() undo changes to prepare_inputs() Raise exception for blocks fewer than num_cp_ranks save intermediate changes attempt to fix attention tests Signed-off-by: Matthias Jouanneaux <[email protected]> save changes for minimal test save minor dev comments added helix inactive rank option to MLA kernels Signed-off-by: Matthias Jouanneaux <[email protected]> pass the right seq_lens_kv - test with seqlen 64 works $ pytest tests/unittest/_torch/modules/test_mla_helix_expt.py -s -v is_inactive_helix at request level cp_allgather for position_id helix: make inactive rank a bool tensor Signed-off-by: Matthias Jouanneaux <[email protected]> undo mapping changes to modeling_deepseek Failed attempt to replace model_config.mapping fill in helix_is_inactive for each request update position_id logic better way to package mapping - repurpose comms creation too save disagg gen-only benchmark test prep for integration test improvements to position_id, num_cached_tokens_per_seq and tokens_per_block changes to save blocks at prefill changes to save blocks at decode add changes to read KV from disk updates to save and read KV blocks for all layers over-allocate at prefill to get cache transmission right prune saved KV cache files updates to avoid over-allocation on gen side in disagg Revert "over-allocate at prefill to get cache transmission right" This reverts commit af7d000. save disagg configs for DSV3 - currently goes OOM verifying tests on 8 GPUs helix: added (working) DS R1 8-GPU integration test Signed-off-by: Matthias Jouanneaux <[email protected]> helix: added large prompt + ds lite config using large prompt Signed-off-by: Matthias Jouanneaux <[email protected]> save intermediate changes for fixes fix debug printing Signed-off-by: Matthias Jouanneaux <[email protected]> Mention cache_transceiver_config.max_tokens_in_buffer for disagg servers save initial changes to benchmarking script added mjoux specific submit script, tighter timeouts, better defaults Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: increase timeouts slightly, use deepgemm moe backend for smaller models Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: add dataset caching path Signed-off-by: Matthias Jouanneaux <[email protected]> fix padding when input_len is divisible by tokens_per_block save changes to test varying prompt len fix_kvcache_split Signed-off-by: Chuang Zhu <[email protected]> avoid fabric memory and print send and recv sizes auto-determine transceiver size Signed-off-by: Matthias Jouanneaux <[email protected]> remove verbose print output Signed-off-by: Matthias Jouanneaux <[email protected]> attempt to fix DS R1 run Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: fix parameters for DS R1 up to 256K tokens Signed-off-by: Matthias Jouanneaux <[email protected]> minor updates to reduce memory footprint and bring back warmup enable cudagraph and add some debug prints ugly hack to get results with 512k updates to benchmark 1M seqlen updates to benchmark 2M seqlen updates for passing down moe properly minor changes to get nsys profiles test helix layer: support for slurm call, support for fp4 Signed-off-by: Matthias Jouanneaux <[email protected]> test helix layer: added sbatch script Signed-off-by: Matthias Jouanneaux <[email protected]> add minimal cache transmission test for 1M seqlen minor bug fix changes to benchmark 4M seqlen skip launch/wait of context servers when TRTLLM_DISAGG_BENCHMARK_GEN_ONLY=1 remove hacks; skip profiling; gpu_mem_frac test helix layer: fix nvfp4 config to fit high perf mode Signed-off-by: Matthias Jouanneaux <[email protected]> helix single layer: improved timing, added arg parsing, added output parsing Signed-off-by: Matthias Jouanneaux <[email protected]> helix single layer: add dense option Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: fix gen_only config, support EP config, add submit script for multiple configs, remove build_wheel by default for array benchmarking Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: added parse script for results Signed-off-by: Matthias Jouanneaux <[email protected]> helix single layer: fixed test, added config submit script, improved parsing Signed-off-by: Matthias Jouanneaux <[email protected]> helix single layer: fix segment for sbatch script Signed-off-by: Matthias Jouanneaux <[email protected]> helix: fixed TP-only runs (removed hack to make higher seq len work), improved sbatch scripts Signed-off-by: Matthias Jouanneaux <[email protected]> helix: fix high node count runs, move back to e2e mode, improve parse script Signed-off-by: Matthias Jouanneaux <[email protected]> longer prompt for DSV3 Lite & DSR1 FP4 integration test disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_tllm_gen_helix disaggregated/test_disaggregated.py::test_disaggregated_deepseek_r1_fp4_tllm_gen_helix helix: added initial README for testing/benchmarking Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: remove references to internal clusters Signed-off-by: Matthias Jouanneaux <[email protected]> minor updates to README minor updates helix: improve transpose/split for alltoall Signed-off-by: Matthias Jouanneaux <[email protected]> Revert "helix: improve transpose/split for alltoall" This reverts commit c8b24b9. helix: improve alltoall perf Signed-off-by: Matthias Jouanneaux <[email protected]> [https://nvbugs/5495789][feat] Optionally disable server GC and worker GC (NVIDIA#7995) Signed-off-by: Tailing Yuan <[email protected]> save changes for custom logging redo cherry-pick of attention.py save more changes for build and pipe-cleaning save more changes clean up - 1 clean up - 2 reuse mla_tensor_params instead of using helix_tensor_params undo all_tp_rank_num_tokens update test_disaggregated.py updates to dsv3RopeOp more cleanup save fp8 disagg test [https://nvbugs/5637012][fix] Fix helix unit tests Signed-off-by: Balaram Buddharaju <[email protected]> minor updates to attention.py updates to test - seqlen 64 works get integration test working

brb-nv changed the title ~~User/brb/integrate helix on main redo mr~~ [None][feat] Integrate helix parallelism Nov 20, 2025

brb-nv mentioned this pull request Nov 20, 2025

[None][feat] Integrate helix on main #8894

Closed

1 task

brb-nv commented Nov 21, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py Outdated Show resolved Hide resolved

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 2 times, most recently from 812dfb9 to 50436a1 Compare November 21, 2025 17:43

brb-nv marked this pull request as ready for review November 21, 2025 17:51

brb-nv requested review from a team as code owners November 21, 2025 17:51

brb-nv requested review from MatthiasKohl, Shixiaowei02, hlu1, laikhtewari and syuoni November 21, 2025 17:51

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from 6f7ffc7 to ec20a04 Compare November 22, 2025 04:04

brb-nv requested a review from a team as a code owner November 23, 2025 01:17

brb-nv requested a review from chuangz0 November 23, 2025 01:17

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 4 times, most recently from ec9faa5 to 7eabb38 Compare November 23, 2025 03:50

syuoni reviewed Nov 25, 2025

View reviewed changes

chuangz0 approved these changes Nov 25, 2025

View reviewed changes

MatthiasKohl suggested changes Nov 25, 2025

View reviewed changes

brb-nv added 4 commits November 25, 2025 12:43

update test prompt

303ba14

formatting

d624dd1

remove hardcoding

3d11205

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from 7eabb38 to 3d11205 Compare November 25, 2025 20:43

[None][feat] Integrate helix parallelism #9342

Are you sure you want to change the base?

[None][feat] Integrate helix parallelism #9342

Conversation

brb-nv commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Release Notes

Uh oh!

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chuangz0 left a comment

Choose a reason for hiding this comment

Uh oh!

MatthiasKohl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laikhtewari commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

brb-nv commented Nov 20, 2025 •

edited

Loading

coderabbitai bot commented Nov 21, 2025 •

edited

Loading