-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][feat] Integrate helix parallelism #9342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[None][feat] Integrate helix parallelism #9342
Conversation
812dfb9 to
50436a1
Compare
📝 WalkthroughWalkthroughThis pull request implements context parallelism support with Helix configuration across the TensorRT-LLM inference stack. It adds per-rank inactivity tracking ( Changes
Sequence Diagram(s)sequenceDiagram
participant Request
participant ResourceMgr as Resource<br/>Manager
participant ModelEngine
participant AttentionBE as Attention<br/>Backend
participant MLAKernel as MLA<br/>Kernel
Request->>ResourceMgr: prepare_resources()
activate ResourceMgr
alt cp_size > 1 and not last rank
ResourceMgr->>ResourceMgr: mark py_helix_is_inactive_rank=true
ResourceMgr->>ResourceMgr: skip KV cache allocation
else active rank
ResourceMgr->>ResourceMgr: allocate KV cache normally
end
deactivate ResourceMgr
Request->>ModelEngine: forward pass (generation)
activate ModelEngine
alt helix_is_inactive_rank[batch]==true
ModelEngine->>ModelEngine: fix past_seen_token_num<br/>(no increment)
ModelEngine->>ModelEngine: skip token processing
else active
ModelEngine->>ModelEngine: increment past_seen_token_num
ModelEngine->>AttentionBE: plan() with helix params
end
deactivate ModelEngine
AttentionBE->>AttentionBE: adjust kv_lens planning<br/>(exclude inactive ranks)
AttentionBE->>MLAKernel: mla_rope_generation<br/>(helix_is_inactive_rank)
activate MLAKernel
alt helix_is_inactive_rank[batch]==true
MLAKernel->>MLAKernel: skip token processing
MLAKernel->>MLAKernel: skip K/V updates
else active
MLAKernel->>MLAKernel: apply rope & assign QKV
MLAKernel->>MLAKernel: update K/V cache
end
deactivate MLAKernel
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Areas requiring extra attention:
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/mapping.py (1)
455-491: Don’t silently overridecp_configto HELIX for anycp_size > 1This block in
Mapping.__init__:################################################################# # TODO: Remove this hardcoding and obtain cp_config from llm_args. if cp_size > 1: cp_config = {"cp_type": CpType.HELIX} #################################################################has broad side effects:
- Any caller that provides a non-Helix
cp_config(e.g. STAR or ULYSSES) withcp_size > 1now gets that configuration silently discarded and treated as HELIX.- Code that branches on
cp_config["cp_type"](e.g._merge_requestsinexecutor_request_queue.py, STAR attention paths, etc.) will never seeCpType.STAR/ULYSSESoncecp_size > 1, effectively breaking those CP modes.- Additional
cp_configfields (like STAR’sblock_size/cp_anchor_size, or future Helix parameters) are lost.If the intent is “for now we only support Helix when
cp_size > 1”, it’s safer to:
- Only inject a default when
cp_configis missing; and- Fail fast on conflicting configs instead of overriding them:
# Temporary default until cp_config is fully plumbed from llm_args. if cp_size > 1: if cp_config is None: cp_config = {"cp_type": CpType.HELIX} elif cp_config.get("cp_type") != CpType.HELIX: raise ValueError( f"Only CpType.HELIX is currently supported when cp_size > 1; got {cp_config.get('cp_type')!r}" )That keeps Helix as the only supported multi-CP mode in this PR, but avoids surprising behavior for existing STAR/ULYSSES configs and makes future extension to other CP types straightforward.
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
1568-1623: Tightenhelix_is_inactive_rankinitialization guard; verify warmup dummy request semantics for HelixThe new Helix logic is mostly sound, but there is one definite initialization bug and one edge case to confirm:
helix_is_inactive_rankinitialization guard is incorrectThe current initialization at line 1568:
helix_is_inactive_rank = [] if self.mapping.cp_size > 1 else Noneinitializes to an empty list for all CP configurations with
cp_size > 1, buthas_cp_helix()returnsTrueonly when bothcp_size > 1andcp_type == CpType.HELIX. For non-Helix CP types (e.g., regular CP or other variants), this creates an empty list that never gets populated, diverging from theNonestate that downstream consumers expect when Helix is disabled.Fix: Change line 1568 to:
helix_is_inactive_rank = [] if self.mapping.has_cp_helix() else NoneWarmup + Helix:
past_seen_token_numoverride semantics need verificationDuring warmup, you correctly skip the
position_idcomputation (line 1605), butpast_seen_token_numis unconditionally overridden based onrequest.orig_prompt_len(lines 1608–1612) whenever Helix is active. This is fed intonum_cached_tokens_per_seq, which becomes part ofKVCacheParams. For dummy warmup requests, ensure:
orig_prompt_lenis consistently initialized for all dummy request types created during warmup, and- the resulting KV cache index values remain within valid bounds on inactive Helix ranks.
Per-request inactivity flag wiring looks correct
The per-beam append pattern (lines 1572–1619) produces a
helix_is_inactive_ranklist with length equal to the total batch size (sum of beam widths), which matches the attention backend's[batch_size]expectation.
2526-2537: Behavioral inconsistency confirmed: ULYSSES passes warmup checks but fails at runtimeThe review concern is valid. I found that the change introduces a systemic breaking behavior across three PyExecutor methods:
- model_engine.py._prepare_inputs (line 2536): Raises for non-STAR/HELIX
- executor_request_queue.py._merge_requests (line 725): Raises for non-STAR/HELIX
- py_executor.py._update_request_states (line 2072): Raises for non-STAR/HELIX
The critical inconsistency:
- Warmup check (model_engine.py line 564) accepts ULYSSES and returns early
- Runtime execution (line 2536) raises
NotImplementedErrorif ULYSSES reaches_prepare_inputs- This means if someone configures PyExecutor with
cp_type=ULYSSES, it will pass initialization but crash during inferenceULYSSES is defined in the CpType enum and explicitly referenced at line 564, indicating it was intended to be handled. However, no fallback path exists in the three runtime dispatch methods, and no test coverage was found for ULYSSES with PyExecutor. The previous behavior would have silently fallen through to the default
_prepare_tp_inputspath.While there's no evidence that existing code uses ULYSSES with PyExecutor, the enum inclusion and warmup-time acceptance create an expectation of support that the runtime contradicts.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
1446-1451: Fix TP sharding after restoring the original mappingDuring
DeepseekV3ForCausalLM.__init__we repurpose CP ranks into TP by installing a temporaryMapping(tp_size = tp * cp). All decoder/MTP modules capture that object viaself.mapping. Later we restoremodel_config.mappingback to the original CP-aware mapping. Here inDeepseekV3MTP.forward, the chunking usesself.model_config.mapping.tp_size/tp_rank, which now point to the restored mapping and no longer match the row-parallel tensors created with the repurposed mapping. On Helix runs (cp_size > 1) this leaves each rank feeding the wrong slice (or no slice) intoeh_proj, breaking generation.Use the same mapping object that the layer captured during init. A minimal fix:
- tp_size = self.model_config.mapping.tp_size - tp_rank = self.model_config.mapping.tp_rank + tp_size = self.mapping.tp_size + tp_rank = self.mapping.tp_rankThat keeps the MTP sharding consistent with the repurposed TP groups.
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
645-705: Helix merge: avoid hardcodedtokens_per_blockand ensuretotal_input_len_cpis available on childrenTwo points in the Helix path:
- Hardcoded
tokens_per_block=32elif cp_type == CpType.HELIX: return self._merge_helix_requests( new_requests, tokens_per_block=32) # tokens_per_block=cp_config['tokens_per_block'])
- This ignores any configured Helix block size (e.g. via
cp_config['tokens_per_block']or KV cache config) and makes the behavior fragile if someone changes the configured block size away from 32.- It also repeats a TODO you already noted to remove this hardcoding.
Suggestion:
- Prefer pulling from config with a safe default + assert, e.g.:
tokens_per_block = cp_config.get('tokens_per_block', 32) assert tokens_per_block > 0 return self._merge_helix_requests(new_requests, tokens_per_block=tokens_per_block)or at minimum assert that a configured value, if present, matches 32 so misconfigurations fail loudly instead of silently diverging.
total_input_len_cpnot propagated to child requestsreq = executor_request_to_llm_request(...) req.total_input_len_cp = input_len req_with_children.append(req) if req.child_requests: req_with_children.extend(req.child_requests)
executor_request_to_llm_requestcreates child requests viaLlmRequest.create_child_request, which only copies attributes whose names start withpy_.- As a result,
total_input_len_cpexists only on the parent; any downstream code that expects this attribute on everyLlmRequest(including children whennum_return_sequences > 1) will not find it.Possible fix:
- Either rename to follow the
py_convention so it’s auto-copied:req.py_total_input_len_cp = input_len for child in req.child_requests: child.py_total_input_len_cp = input_len
- Or, if you deliberately want a non-
py_attribute, explicitly set it on children in this loop.This will keep Helix metadata consistent across parent and child requests and future-proof the code against differing
tokens_per_blockconfigs.#!/bin/bash # Check how Helix-related fields are used so they stay consistent. rg -n "total_input_len_cp" -C3 rg -n "tokens_per_block" tensorrt_llm/_torch/pyexecutor -C3Also applies to: 710-723
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
440-449: Decode-time KV allocation correctly gated on active Helix rankMarking
req.py_helix_is_inactive_rankon non-last CP ranks whenhas_cp_helix()and skippingadd_tokenthere ensures only the active Helix rank allocates decode-time KV cache, which matches the design.You might consider using
mapping.is_last_cp_rank()(and/or setting this flag once at request construction) for slightly clearer intent, but the current logic is functionally sound.examples/llm-api/quickstart_advanced.py (1)
71-76:cp_sizeflag andcontext_parallel_sizewiring are consistentThe new
--cp_sizeargument and its use ascontext_parallel_size=args.cp_sizein theLLMconstructor align with the new CP plumbing. The change is self-contained and doesn’t affect existing callers.Optionally, you might extend the help string for
--cp_sizeto mention that multi-CP currently implies Helix in this flow, so users know what they’re opting into.Also applies to: 261-264
tests/integration/defs/disaggregated/test_disaggregated.py (1)
154-274: New DeepSeek V3 Lite bf16 Helix disaggregated test wiring looks consistentThe new config entry and
test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helixfollow the same symlink +run_disaggregated_testpattern as the existing DeepSeek tests, and thetest_descstring matches the key added toconfig_map, so the wiring looks correct.If you want to silence Ruff’s ARG001 warning, you could rename
disaggregated_test_rootto_disaggregated_test_rootin the new test (or add a# noqa: ARG001), but that’s cosmetic and consistent with the rest of this file.Also applies to: 1915-1933
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
12-13: Remove duplicateLlmRequestimport
LlmRequestis imported twice in this file (here and again at line 62). You can safely drop the earlier import and keep the one that also brings inget_draft_token_length:-from .llm_request import LlmRequest - import torchThis keeps imports minimal without changing behavior.
tensorrt_llm/commands/serve.py (1)
5-5: Drop the duplicategcimport
Line 2 already importsgc, so this second import triggers Ruff F811 (Redefinition of unused gc). Please drop the extra line to keep lint happy.-import gc
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
cpp/tensorrt_llm/kernels/mlaKernels.cu(4 hunks)cpp/tensorrt_llm/kernels/mlaKernels.h(1 hunks)cpp/tensorrt_llm/thop/attentionOp.cpp(2 hunks)cpp/tensorrt_llm/thop/dsv3RopeOp.cpp(6 hunks)examples/llm-api/quickstart_advanced.py(2 hunks)tensorrt_llm/_torch/attention_backend/trtllm.py(9 hunks)tensorrt_llm/_torch/distributed/communicator.py(2 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py(12 hunks)tensorrt_llm/_torch/modules/attention.py(10 hunks)tensorrt_llm/_torch/pyexecutor/executor_request_queue.py(3 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(5 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)tensorrt_llm/commands/serve.py(7 hunks)tensorrt_llm/llmapi/disagg_utils.py(1 hunks)tensorrt_llm/mapping.py(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(2 hunks)
🧰 Additional context used
🧠 Learnings (27)
📓 Common learnings
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpptensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpptensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpptensorrt_llm/llmapi/disagg_utils.pycpp/tensorrt_llm/thop/dsv3RopeOp.cpptensorrt_llm/_torch/pyexecutor/executor_request_queue.pytensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpptensorrt_llm/llmapi/disagg_utils.pycpp/tensorrt_llm/thop/dsv3RopeOp.cpptensorrt_llm/_torch/pyexecutor/executor_request_queue.pytensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cppcpp/tensorrt_llm/kernels/mlaKernels.cutensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpptensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpp
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpptensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpptensorrt_llm/_torch/pyexecutor/executor_request_queue.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
tensorrt_llm/llmapi/disagg_utils.pycpp/tensorrt_llm/thop/dsv3RopeOp.cpptensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
cpp/tensorrt_llm/thop/dsv3RopeOp.cpptensorrt_llm/_torch/pyexecutor/executor_request_queue.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.
Applied to files:
tensorrt_llm/_torch/distributed/communicator.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/disaggregated/test_disaggregated.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/defs/disaggregated/test_disaggregated.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/integration/defs/disaggregated/test_disaggregated.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/kernels/mlaKernels.cutensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
tensorrt_llm/mapping.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tensorrt_llm/commands/serve.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tensorrt_llm/commands/serve.py
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.
Applied to files:
tensorrt_llm/_torch/modules/attention.py
🧬 Code graph analysis (12)
tensorrt_llm/_torch/distributed/communicator.py (2)
tensorrt_llm/mapping.py (3)
Mapping(351-515)rank(187-188)rank(191-198)tensorrt_llm/llmapi/llm_args.py (2)
world_size(459-460)world_size(469-473)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
disaggregated_test_root(2618-2623)disaggregated_example_root(285-290)llm_venv(702-719)deepseek_v3_model_root(616-631)tests/integration/defs/local_venv.py (1)
get_working_directory(43-49)
cpp/tensorrt_llm/kernels/mlaKernels.cu (1)
cpp/tensorrt_llm/kernels/mlaKernels.h (2)
helix_position_offsets(109-134)helix_is_inactive_rank(112-113)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)
tensorrt_llm/_torch/distributed/communicator.py (5)
tp_size(64-65)has_pp(52-53)cp_size(56-57)rank(40-41)rank(457-458)tensorrt_llm/mapping.py (3)
has_pp(258-259)rank(187-188)rank(191-198)
tensorrt_llm/mapping.py (1)
tensorrt_llm/_torch/distributed/communicator.py (2)
cp_size(56-57)cp_config(108-109)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/runtime/model_runner.py (1)
mapping(824-825)tensorrt_llm/_torch/distributed/communicator.py (3)
has_cp_helix(104-105)cp_rank(68-69)cp_size(56-57)tensorrt_llm/mapping.py (2)
has_cp_helix(233-235)cp_rank(534-535)tensorrt_llm/_torch/device_mesh.py (1)
cp_rank(84-86)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
tensorrt_llm/_torch/distributed/communicator.py (8)
cp_size(56-57)cp_rank(68-69)tp_size(64-65)world_size(44-45)rank(40-41)rank(457-458)cp_config(108-109)pp_size(60-61)tensorrt_llm/mapping.py (4)
cp_rank(534-535)Mapping(351-515)rank(187-188)rank(191-198)tensorrt_llm/_torch/model_config.py (1)
ModelConfig(75-616)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
LlmRequest(437-663)append(101-127)append(195-212)cached_tokens(569-570)cached_tokens(573-576)tensorrt_llm/mapping.py (3)
CpType(24-32)has_cp_helix(233-235)cp_rank(534-535)tensorrt_llm/_torch/distributed/communicator.py (3)
cp_size(56-57)has_cp_helix(104-105)cp_rank(68-69)tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
is_warmup(344-345)is_warmup(348-353)
tensorrt_llm/commands/serve.py (3)
tensorrt_llm/runtime/model_runner.py (1)
mapping(824-825)tensorrt_llm/mapping.py (1)
CpType(24-32)tensorrt_llm/_torch/distributed/communicator.py (4)
cp_config(108-109)tp_size(64-65)pp_size(60-61)cp_size(56-57)
examples/llm-api/quickstart_advanced.py (1)
tensorrt_llm/_torch/distributed/communicator.py (1)
cp_size(56-57)
tensorrt_llm/_torch/attention_backend/trtllm.py (3)
cpp/tensorrt_llm/kernels/mlaKernels.h (2)
helix_position_offsets(109-134)helix_is_inactive_rank(112-113)tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
cached_token_lens(116-118)tensorrt_llm/_torch/attention_backend/interface.py (2)
seq_lens_kv(226-227)seq_lens_kv(230-237)
tensorrt_llm/_torch/modules/attention.py (3)
cpp/tensorrt_llm/kernels/helixKernels.h (1)
tensorrt_llm(26-46)tensorrt_llm/mapping.py (1)
CpType(24-32)cpp/tensorrt_llm/kernels/mlaKernels.h (1)
helix_position_offsets(109-134)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/distributed/communicator.py
349-349: f-string without any placeholders
Remove extraneous f prefix
(F541)
367-367: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/integration/defs/disaggregated/test_disaggregated.py
1919-1919: Unused function argument: disaggregated_test_root
(ARG001)
tensorrt_llm/_torch/models/modeling_deepseekv3.py
1561-1561: Loop control variable idx not used within loop body
Rename unused idx to _idx
(B007)
1588-1588: f-string without any placeholders
Remove extraneous f prefix
(F541)
1660-1660: f-string without any placeholders
Remove extraneous f prefix
(F541)
tensorrt_llm/commands/serve.py
5-5: Redefinition of unused gc from line 2
Remove definition: gc
(F811)
130-131: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
130-131: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/llmapi/disagg_utils.py (1)
199-201: Including PP and CP ininstance_num_ranksis consistent with mapping/world-size logicUsing
tensor_parallel_size * pipeline_parallel_size * context_parallel_sizemakes the per-instance rank count match the full mapping used later insplit_world_comm, while preserving backward compatibility when PP/CP are omitted (default 1). Looks good.tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1)
1-32: Helix disaggregated gen config is internally consistentContext and generation sections use TP/CP sizes in a way that matches the updated disaggregation logic (
context_parallel_sizeonly on the generation side for Helix decode-only). No issues spotted.tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
441-513:py_helix_is_inactive_rankflag wiring is consistentInitializing
self.py_helix_is_inactive_rank = Falsealongside otherpy_*fields and relying oncreate_child_request’spy_copying is exactly what the resource manager needs for per-request Helix inactivity tracking. No changes requested.tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
316-320: CP-aware attachment of Python-only request metadata looks goodExtending the condition to
(tp_size > 1 or has_pp or cp_size > 1)ensures CP-only topologies also receive Python-side attributes on non-root ranks, without affecting single-rank setups. This is the right direction.cpp/tensorrt_llm/thop/attentionOp.cpp (1)
181-238: No changes needed; concerns are not supported by the codebaseThis review comment identifies issues that do not exist in the actual code:
Backward compatibility: All current Python call sites already pass exactly 2 tensors in
mla_tensor_params(lines 481 and 1771 intensorrt_llm/_torch/attention_backend/trtllm.py). Thesize() == 2check enforces the existing pattern consistently, not a breaking change.Uninitialized helix fields: The
MlaParamsstruct explicitly initializes bothhelix_position_offsets{nullptr}andhelix_is_inactive_rank{nullptr}as member initializers, ensuring they are always zero-initialized regardless of constructor form. The suggestedMlaParams<T> mla_params{}is redundant.API design: The required parameter without defaults aligns with TensorRT-LLM's intentional design for consistency in tensor parameter collections (documented in learnings from PR 6904).
Likely an incorrect or invalid review comment.
cpp/tensorrt_llm/kernels/mlaKernels.h (1)
108-112: Helix inactivity flag plumbed cleanly into MLA paramsThe new
helix_is_inactive_rankpointer mirrors the existinghelix_position_offsetspattern and keeps the struct’s nullable‑by‑default contract; no issues spotted here.cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1)
45-70: Helix inactivity flag is correctly threaded through MLA rope generationThe new
helix_is_inactive_rankpointer is plumbed cleanly frommla_tensor_paramsintoMlaRopeGenArgsand then intoMlaParams, with a strict but reasonable contract (mla_tensor_params.size() == 2). Optional handling viahas_value() ? data_ptr : nullptrmatches the existing pattern forhelix_position_offsets. No functional issues spotted.Also applies to: 88-111, 139-168, 277-283
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
545-566: Warmup gating by cp_type now correctly excludes only ULYSSES/STARThe updated warmup logic:
cp_type = self.mapping.cp_config.get('cp_type', None) if cp_type is not None: if cp_type in [CpType.ULYSSES, CpType.STAR]: returnmeans HELIX (and other non‑ULYSSES/STAR cp types) still run warmup, which is what you want for Helix CUDA graph / torch.compile specialization. This looks consistent with the new Helix integration and doesn’t affect non‑CP runs.
| if self.helix_is_inactive_rank is not None and len( | ||
| self.helix_is_inactive_rank): | ||
| # If helix is inactive, attend to the previously cached tokens only. | ||
| # This gets further complicated with multiple requests as each request might | ||
| # have a different active helix rank. | ||
| assert cached_token_lens is not None, "cached_token_lens should be set for helix" | ||
| kv_lens = cached_token_lens | ||
| helix_is_inactive_rank_cpu = torch.tensor( | ||
| self.helix_is_inactive_rank, | ||
| dtype=torch.bool, | ||
| device='cpu', | ||
| ) | ||
| active_rank = ~helix_is_inactive_rank_cpu | ||
| kv_lens[active_rank] += self.seq_lens_kv[active_rank] | ||
| else: | ||
| kv_lens = cached_token_lens + self.seq_lens_kv if cached_token_lens is not None else self.seq_lens_kv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve cached token counts when gating inactive Helix ranks
kv_lens = cached_token_lens keeps both names pointing to the same tensor, so the in-place kv_lens[active_rank] += … also mutates cached_token_lens. Downstream code (e.g., prepare_context_mla_with_cached_kv) expects cached_token_lens to still represent only the pre-existing cache, so the extra in-place additions inflate every consumer (num_ctx_cached_tokens, max_ctx_cached_token_len, etc.). On Helix-enabled runs this corrupts scheduler state and can mis-size allocations or loop bounds.
Grab a detached copy before modifying:
- kv_lens = cached_token_lens
+ kv_lens = cached_token_lens.clone()
kv_lens[active_rank] += self.seq_lens_kv[active_rank]🤖 Prompt for AI Agents
In tensorrt_llm/_torch/attention_backend/trtllm.py around lines 855-870, kv_lens
is set to reference cached_token_lens and then mutated in-place, which corrupts
cached_token_lens for downstream consumers; fix by making kv_lens a copy (e.g.,
kv_lens = cached_token_lens.clone() or cached_token_lens.detach().clone()) so
subsequent kv_lens[active_rank] += ... mutates only the local copy and leaves
cached_token_lens unchanged, preserving expected cached counts.
| self.create_cp_comm() | ||
| # Repurpose CP ranks to TP for Helix so that the right comms are created. | ||
| mapping_with_helix = None | ||
| if self.mapping.cp_size > 1: | ||
| print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") | ||
| mapping_with_helix = copy.deepcopy(self.mapping) | ||
| mapping_without_helix = Mapping( | ||
| world_size=self.mapping.world_size, | ||
| rank=self.mapping.rank, | ||
| gpus_per_node=self.mapping.gpus_per_node, | ||
| cp_size=1, | ||
| cp_config={}, | ||
| tp_size=self.mapping.tp_size * self.mapping.cp_size, | ||
| pp_size=self.mapping.pp_size, | ||
| moe_ep_size=self.mapping.moe_ep_size, | ||
| enable_attention_dp=self.mapping.enable_attention_dp) | ||
| self.mapping = mapping_without_helix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Avoid deep‑copying Mapping and consider reusing more fields in the Helix repurposed mapping
The temporary repurposing of CP into TP is a good approach for Helix, but two details are worth tightening:
-
Deep copy of
mappingis unnecessary and may desync shared stateYou currently do:
mapping_with_helix = copy.deepcopy(self.mapping) ... self.mapping = mapping_with_helix
This makes
MPIDist.mappingrefer to a newMappinginstance, different from the original object passed into the engine, which could be surprising if other components rely on identity or mutate the original mapping later.Since you don’t mutate the original
mappingobject here, a plain alias is enough:
-
mapping_with_helix = None -
if self.mapping.cp_size > 1: -
... -
mapping_with_helix = copy.deepcopy(self.mapping)
-
mapping_with_helix = None -
if self.mapping.cp_size > 1: -
... -
mapping_with_helix = self.mapping
-
mapping_without_helixonly forwards a subset of Mapping kwargsToday you only pass
world_size,rank,gpus_per_node,{cp,tp,pp}_size,moe_ep_size, andenable_attention_dp. If future code inMappingstarts deriving TP/PP groups from other fields (e.g.,pp_partition,moe_cluster_size,moe_tp_size,attn_tp_size,attn_cp_size),mapping_without_helixmay no longer faithfully represent the original layout.Consider either:
- adding the missing fields when constructing
Mapping(...), or - adding a helper on
Mapping(e.g.,mapping.repurpose_for_helix_tp()) that centralizes this logic.
That keeps communicator initialization robust to future Mapping changes.
- adding the missing fields when constructing
I'll verify the concerns raised in this review comment by investigating the codebase.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find the Mapping class definition to understand all available fields
rg -n "class Mapping" --type=py
Length of output: 101
🏁 Script executed:
#!/bin/bash
# Look at the complete context around lines 345-361
sed -n '330,380p' tensorrt_llm/_torch/distributed/communicator.pyLength of output: 1847
🏁 Script executed:
#!/bin/bash
# Check how mapping_with_helix is used after creation
rg -n "mapping_with_helix" tensorrt_llm/_torch/distributed/communicator.py -A5 -B2Length of output: 1008
🏁 Script executed:
#!/bin/bash
# Check how mapping_without_helix is used
rg -n "mapping_without_helix" tensorrt_llm/_torch/distributed/communicator.py -A3 -B2Length of output: 620
🏁 Script executed:
#!/bin/bash
# Check if self.mapping is reassigned or mutated after line 361
sed -n '345,450p' tensorrt_llm/_torch/distributed/communicator.pyLength of output: 3687
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the Mapping class definition and __init__ method to see all fields
sed -n '351,450p' tensorrt_llm/mapping.py
Length of output: 1446
🏁 Script executed:
#!/bin/bash
# Look at the complete Mapping class definition including constructor
ast-grep --pattern 'class Mapping {
$$$
}'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Use rg to find the __init__ method of Mapping class with full signature
rg -A 40 "class Mapping\(MappingBase\)" tensorrt_llm/mapping.pyLength of output: 572
---
🏁 Script executed:
```shell
#!/bin/bash
# Find the __init__ method in Mapping class
rg -n "def __init__" tensorrt_llm/mapping.py | head -5
Length of output: 132
🏁 Script executed:
#!/bin/bash
# Get the full __init__ method of Mapping, starting from line search
sed -n '450,600p' tensorrt_llm/mapping.pyLength of output: 5541
🏁 Script executed:
#!/bin/bash
# Search for Mapping's __init__ with full parameter list
rg -A 60 "class Mapping\(MappingBase\):" tensorrt_llm/mapping.py | grep -A 50 "def __init__"Length of output: 0
🏁 Script executed:
#!/bin/bash
# Find all __init__ in mapping.py and show which line each one is on
rg -n "def __init__" tensorrt_llm/mapping.pyLength of output: 132
Forward all Mapping constructor parameters when creating mapping_without_helix
The review identifies two valid concerns:
-
Deep copy is unnecessary: Since the original mapping is never mutated, a simple alias (
mapping_with_helix = self.mapping) is sufficient instead ofcopy.deepcopy(self.mapping). -
Missing constructor parameters: The Mapping class accepts 15 parameters, but mapping_without_helix only forwards 9 of them. Missing parameters are:
pp_partitionmoe_cluster_sizemoe_tp_sizeattn_tp_sizeattn_cp_sizeenable_lm_head_tp_in_adp
Omitting these fields could cause incorrect communication group initialization or future incompatibility as the code evolves. Forward all parameters from the original mapping to ensure the temporary mapping_without_helix has the same configuration.
🧰 Tools
🪛 Ruff (0.14.5)
349-349: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/distributed/communicator.py around lines 345 to 361,
replace the unnecessary deep copy and ensure mapping_without_helix forwards all
Mapping constructor parameters from the original mapping: set mapping_with_helix
= self.mapping (no deepcopy) and construct mapping_without_helix by passing
every field from self.mapping but with cp_size=1 and
tp_size=self.mapping.tp_size * self.mapping.cp_size, explicitly including
pp_partition, moe_cluster_size, moe_tp_size, attn_tp_size, attn_cp_size,
enable_lm_head_tp_in_adp (and any other parameters the Mapping constructor
expects) so the temporary mapping preserves all original settings except the
repurposed CP/TP sizes.
| mapping_with_helix = None | ||
| if self.mapping.cp_size > 1: | ||
| print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") | ||
| mapping_with_helix = copy.deepcopy(self.mapping) | ||
| mapping_without_helix = Mapping( | ||
| world_size=self.mapping.world_size, | ||
| rank=self.mapping.rank, | ||
| gpus_per_node=self.mapping.gpus_per_node, | ||
| cp_size=1, | ||
| cp_config={}, | ||
| tp_size=self.mapping.tp_size * self.mapping.cp_size, | ||
| pp_size=self.mapping.pp_size, | ||
| moe_ep_size=self.mapping.moe_ep_size, | ||
| enable_attention_dp=self.mapping.enable_attention_dp) | ||
| self.mapping = mapping_without_helix | ||
| self.create_tp_comm() | ||
| self.create_pp_comm() | ||
| self.create_cp_comm() | ||
|
|
||
| # Restore the original mapping. | ||
| if mapping_with_helix is not None: | ||
| print(f"[MPIDist::__init__] Restoring original mapping.") | ||
| self.mapping = mapping_with_helix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace bare print(f"...") with logger and drop unused f prefix
Using print(f"...") here both trips Ruff (F541) and is inconsistent with the rest of this module’s logging style. Suggest switching to the existing logger and removing the unnecessary f:
- if self.mapping.cp_size > 1:
- print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
+ if self.mapping.cp_size > 1:
+ logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.")
@@
- # Restore the original mapping.
- if mapping_with_helix is not None:
- print(f"[MPIDist::__init__] Restoring original mapping.")
- self.mapping = mapping_with_helix
+ # Restore the original mapping.
+ if mapping_with_helix is not None:
+ logger.info("[MPIDist::__init__] Restoring original mapping.")
+ self.mapping = mapping_with_helixThis keeps logs consistent and satisfies the linter.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mapping_with_helix = None | |
| if self.mapping.cp_size > 1: | |
| print(f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") | |
| mapping_with_helix = copy.deepcopy(self.mapping) | |
| mapping_without_helix = Mapping( | |
| world_size=self.mapping.world_size, | |
| rank=self.mapping.rank, | |
| gpus_per_node=self.mapping.gpus_per_node, | |
| cp_size=1, | |
| cp_config={}, | |
| tp_size=self.mapping.tp_size * self.mapping.cp_size, | |
| pp_size=self.mapping.pp_size, | |
| moe_ep_size=self.mapping.moe_ep_size, | |
| enable_attention_dp=self.mapping.enable_attention_dp) | |
| self.mapping = mapping_without_helix | |
| self.create_tp_comm() | |
| self.create_pp_comm() | |
| self.create_cp_comm() | |
| # Restore the original mapping. | |
| if mapping_with_helix is not None: | |
| print(f"[MPIDist::__init__] Restoring original mapping.") | |
| self.mapping = mapping_with_helix | |
| mapping_with_helix = None | |
| if self.mapping.cp_size > 1: | |
| logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") | |
| mapping_with_helix = copy.deepcopy(self.mapping) | |
| mapping_without_helix = Mapping( | |
| world_size=self.mapping.world_size, | |
| rank=self.mapping.rank, | |
| gpus_per_node=self.mapping.gpus_per_node, | |
| cp_size=1, | |
| cp_config={}, | |
| tp_size=self.mapping.tp_size * self.mapping.cp_size, | |
| pp_size=self.mapping.pp_size, | |
| moe_ep_size=self.mapping.moe_ep_size, | |
| enable_attention_dp=self.mapping.enable_attention_dp) | |
| self.mapping = mapping_without_helix | |
| self.create_tp_comm() | |
| self.create_pp_comm() | |
| # Restore the original mapping. | |
| if mapping_with_helix is not None: | |
| logger.info("[MPIDist::__init__] Restoring original mapping.") | |
| self.mapping = mapping_with_helix |
🧰 Tools
🪛 Ruff (0.14.5)
349-349: f-string without any placeholders
Remove extraneous f prefix
(F541)
367-367: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/distributed/communicator.py around lines 347 to 368,
replace the two bare print(f"...") calls with the module logger (e.g.,
logger.info or logger.debug) and remove the unused f-string prefix; keep the
same message text but pass it as a normal string to logger (e.g.,
logger.info("[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") and
logger.info("[MPIDist::__init__] Restoring original mapping.")), ensuring the
module's logger is used consistently and no unused f-strings remain.
6f7ffc7 to
ec20a04
Compare
ec9faa5 to
7eabb38
Compare
| model_config.mapping = Mapping( | ||
| world_size=model_config.mapping.world_size, | ||
| rank=model_config.mapping.rank, | ||
| gpus_per_node=model_config.mapping.gpus_per_node, | ||
| cp_size=1, | ||
| cp_config={}, | ||
| tp_size=original_tp_size * original_cp_size, | ||
| pp_size=model_config.mapping.pp_size, | ||
| moe_ep_size=model_config.mapping.moe_ep_size, | ||
| enable_attention_dp=model_config.mapping.enable_attention_dp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic appears multiple times, maybe we can wrap it as a method of Mapping like repurpose_kvp_to_tp?
| @click.option("--cp_size", | ||
| type=int, | ||
| default=1, | ||
| help='Context parallelism size.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add cp_size to trtllm-bench and trtllm-eval
| if self.mapping.cp_size > 1: | ||
| logger.info( | ||
| f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") | ||
| mapping_with_helix = copy.deepcopy(self.mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like mapping_with_helix is same to mapping_with_cp, could we unify the naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, +1 to update this to mapping_with_cp here as everything else in mapping refers to it as CP
| print( | ||
| f"[DeepseekV3ForCausalLM::__init__] Repurposing KVP ranks to TP while keeping other details the same." | ||
| ) | ||
| self.mapping_with_cp = copy.deepcopy(model_config.mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the only difference between mapping_with_cp and mapping is about the repurposed tp_size and cp_size. If so, is it possible to unify the two mapping objects to one (instead of the duplication)?
For example, we can use a subclass HelixMapping which has a flag indicating whether it's "repurposed", and this flag affects the values accessed via mapping.tp_size and mapping.cp_size (probably two properties).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the issue was that the mapping is being passed around quite a bit, and then modules are set up depending on the values in the mapping. So using a sub-class + a repurposed flag may still be quite tricky to get right because it's hard to set the flag at the right time during __init__ of those sub-modules.
If we could easily set the flag, we could have also easily just updated the model_config.mapping or some other mapping object in place here, but unfortunately, it's not that easy.
If you have a suggestion which is passing integration tests, I think we'd be happy to use that !
chuangz0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me for disagg part
MatthiasKohl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainly minor things, overall LGTM
| if self.mapping.cp_size > 1: | ||
| logger.info( | ||
| f"[MPIDist::__init__] Repurposing CP ranks to TP for Helix.") | ||
| mapping_with_helix = copy.deepcopy(self.mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, +1 to update this to mapping_with_cp here as everything else in mapping refers to it as CP
| print( | ||
| f"[DeepseekV3ForCausalLM::__init__] Repurposing KVP ranks to TP while keeping other details the same." | ||
| ) | ||
| self.mapping_with_cp = copy.deepcopy(model_config.mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the issue was that the mapping is being passed around quite a bit, and then modules are set up depending on the values in the mapping. So using a sub-class + a repurposed flag may still be quite tricky to get right because it's hard to set the flag at the right time during __init__ of those sub-modules.
If we could easily set the flag, we could have also easily just updated the model_config.mapping or some other mapping object in place here, but unfortunately, it's not that easy.
If you have a suggestion which is passing integration tests, I think we'd be happy to use that !
| self.num_heads = num_attention_heads | ||
| self.num_key_value_heads = num_key_value_heads | ||
| self.num_key_value_groups = self.num_heads // self.num_key_value_heads | ||
| assert self.num_heads == self.num_key_value_heads, "num_heads must be equal to num_key_value_heads" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need to remove this again, because latest main has some cases where num_heads and num_key_value_heads are different for DSA, but I'm not 100% sure.
| # all_scenarios[15], | ||
| # all_scenarios[21], | ||
| # all_scenarios[22], | ||
| all_scenarios[-1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we only want to test the small ctx_len ultimately or is this left over from debugging?
|
Where is usage documented? I don't see any docs in the changed files list |
Signed-off-by: Balaram Buddharaju <[email protected]> add ds-lite tllm-gen based disagg test Signed-off-by: Matthias Jouanneaux <[email protected]> initial support for helix parallelism Signed-off-by: Matthias Jouanneaux <[email protected]> fixed mapping tests, added working MLA module test, added disagg test for helix (WIP) Signed-off-by: Matthias Jouanneaux <[email protected]> Helix MLA module test: added more scenarios, removed unnecessary code Signed-off-by: Matthias Jouanneaux <[email protected]> MLA Helix test: restricting number of tests, better output Signed-off-by: Matthias Jouanneaux <[email protected]> test MLA helix: remove OOM test scenario Signed-off-by: Matthias Jouanneaux <[email protected]> test MLA helix: fix scenario max position embeddings Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: try to fix NaNs Signed-off-by: Matthias Jouanneaux <[email protected]> added all-to-all impl Signed-off-by: Matthias Jouanneaux <[email protected]> fix thop lib Signed-off-by: Matthias Jouanneaux <[email protected]> fix alltoall Signed-off-by: Matthias Jouanneaux <[email protected]> attention MLA: remove kv heads (unused), improve heads naming, fix tests Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: minor fixes Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: disable numeric test Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: add TODOs to MLA module Signed-off-by: Matthias Jouanneaux <[email protected]> test Helix MLA: fix MLA module Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> debugging Signed-off-by: Matthias Jouanneaux <[email protected]> fully working MLA test Signed-off-by: Matthias Jouanneaux <[email protected]> attempt to make latent cache work Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> debugging numerical issue Signed-off-by: Matthias Jouanneaux <[email protected]> adding additional test for further numerical debugging Signed-off-by: Matthias Jouanneaux <[email protected]> fixing tests & correction Signed-off-by: Matthias Jouanneaux <[email protected]> remove debug output from tests Signed-off-by: Matthias Jouanneaux <[email protected]> fix tests Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging with multiple sequences Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging with multiple sequences Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging with multiple sequences Signed-off-by: Matthias Jouanneaux <[email protected]> fixed multiple sequences tests Signed-off-by: Matthias Jouanneaux <[email protected]> automated review comments Signed-off-by: Matthias Jouanneaux <[email protected]> debugging of latent cache Signed-off-by: Matthias Jouanneaux <[email protected]> debugging of latent cache Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging of pe values Signed-off-by: Matthias Jouanneaux <[email protected]> further debugging of latent cache Signed-off-by: Matthias Jouanneaux <[email protected]> fixed latent cache, remove flaky test Signed-off-by: Matthias Jouanneaux <[email protected]> better reporting Signed-off-by: Matthias Jouanneaux <[email protected]> better reporting Signed-off-by: Matthias Jouanneaux <[email protected]> finalized test scenarios Signed-off-by: Matthias Jouanneaux <[email protected]> better perf measurements, added graph support Signed-off-by: Matthias Jouanneaux <[email protected]> added helix post process kernel Signed-off-by: Matthias Jouanneaux <[email protected]> added unit test, minor fix for helix kernel Signed-off-by: Matthias Jouanneaux <[email protected]> fixing helix kernels Signed-off-by: Matthias Jouanneaux <[email protected]> better tests, minor fixes Signed-off-by: Matthias Jouanneaux <[email protected]> better tests, minor fixes Signed-off-by: Matthias Jouanneaux <[email protected]> debugging helix test Signed-off-by: Matthias Jouanneaux <[email protected]> debugging helix test Signed-off-by: Matthias Jouanneaux <[email protected]> debugging helix test Signed-off-by: Matthias Jouanneaux <[email protected]> fixed helix post process kernel: main kernel had perf issue/flaw Signed-off-by: Matthias Jouanneaux <[email protected]> fixed helix post process test Signed-off-by: Matthias Jouanneaux <[email protected]> added helix full layer test Signed-off-by: Matthias Jouanneaux <[email protected]> fix full layer helix test/bench Signed-off-by: Matthias Jouanneaux <[email protected]> added correct mapping to ds helix Signed-off-by: Matthias Jouanneaux <[email protected]> further improvements for fp8 init Signed-off-by: Matthias Jouanneaux <[email protected]> debugging quantization config Signed-off-by: Matthias Jouanneaux <[email protected]> better debug output Signed-off-by: Matthias Jouanneaux <[email protected]> fixes for fp8 Signed-off-by: Matthias Jouanneaux <[email protected]> fix fp8 runs Signed-off-by: Matthias Jouanneaux <[email protected]> attempt to fix fp8 context Signed-off-by: Matthias Jouanneaux <[email protected]> fix context phase: just randomly gen kv cache values. fix scenario sizes Signed-off-by: Matthias Jouanneaux <[email protected]> fix tp size config in helix layer test Signed-off-by: Matthias Jouanneaux <[email protected]> minor changes for test get trtllm-serve working with BF16 for gen with cp - v_b_proj weight loading needs to be revisited $ CUDA_VISIBLE_DEVICES=0,1 trtllm-serve /home/scratch.trt_llm_data/llm-models/DeepSeek-V3-Lite/bf16/ --host localhost --port 8002 --cp_size 2 --extra_llm_api_options ./gen_extra-llm-api-config.yaml end-to-end test in disagg works $ pytest tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix -s -v Switch to contiguous block dist among CP rank save changes to _merge_requests() undo changes to prepare_inputs() Raise exception for blocks fewer than num_cp_ranks save intermediate changes attempt to fix attention tests Signed-off-by: Matthias Jouanneaux <[email protected]> save changes for minimal test save minor dev comments added helix inactive rank option to MLA kernels Signed-off-by: Matthias Jouanneaux <[email protected]> pass the right seq_lens_kv - test with seqlen 64 works $ pytest tests/unittest/_torch/modules/test_mla_helix_expt.py -s -v is_inactive_helix at request level cp_allgather for position_id helix: make inactive rank a bool tensor Signed-off-by: Matthias Jouanneaux <[email protected]> undo mapping changes to modeling_deepseek Failed attempt to replace model_config.mapping fill in helix_is_inactive for each request update position_id logic better way to package mapping - repurpose comms creation too save disagg gen-only benchmark test prep for integration test improvements to position_id, num_cached_tokens_per_seq and tokens_per_block changes to save blocks at prefill changes to save blocks at decode add changes to read KV from disk updates to save and read KV blocks for all layers over-allocate at prefill to get cache transmission right prune saved KV cache files updates to avoid over-allocation on gen side in disagg Revert "over-allocate at prefill to get cache transmission right" This reverts commit af7d000. save disagg configs for DSV3 - currently goes OOM verifying tests on 8 GPUs helix: added (working) DS R1 8-GPU integration test Signed-off-by: Matthias Jouanneaux <[email protected]> helix: added large prompt + ds lite config using large prompt Signed-off-by: Matthias Jouanneaux <[email protected]> save intermediate changes for fixes fix debug printing Signed-off-by: Matthias Jouanneaux <[email protected]> Mention cache_transceiver_config.max_tokens_in_buffer for disagg servers save initial changes to benchmarking script added mjoux specific submit script, tighter timeouts, better defaults Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: increase timeouts slightly, use deepgemm moe backend for smaller models Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: add dataset caching path Signed-off-by: Matthias Jouanneaux <[email protected]> fix padding when input_len is divisible by tokens_per_block save changes to test varying prompt len fix_kvcache_split Signed-off-by: Chuang Zhu <[email protected]> avoid fabric memory and print send and recv sizes auto-determine transceiver size Signed-off-by: Matthias Jouanneaux <[email protected]> remove verbose print output Signed-off-by: Matthias Jouanneaux <[email protected]> attempt to fix DS R1 run Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: fix parameters for DS R1 up to 256K tokens Signed-off-by: Matthias Jouanneaux <[email protected]> minor updates to reduce memory footprint and bring back warmup enable cudagraph and add some debug prints ugly hack to get results with 512k updates to benchmark 1M seqlen updates to benchmark 2M seqlen updates for passing down moe properly minor changes to get nsys profiles test helix layer: support for slurm call, support for fp4 Signed-off-by: Matthias Jouanneaux <[email protected]> test helix layer: added sbatch script Signed-off-by: Matthias Jouanneaux <[email protected]> add minimal cache transmission test for 1M seqlen minor bug fix changes to benchmark 4M seqlen skip launch/wait of context servers when TRTLLM_DISAGG_BENCHMARK_GEN_ONLY=1 remove hacks; skip profiling; gpu_mem_frac test helix layer: fix nvfp4 config to fit high perf mode Signed-off-by: Matthias Jouanneaux <[email protected]> helix single layer: improved timing, added arg parsing, added output parsing Signed-off-by: Matthias Jouanneaux <[email protected]> helix single layer: add dense option Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: fix gen_only config, support EP config, add submit script for multiple configs, remove build_wheel by default for array benchmarking Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: added parse script for results Signed-off-by: Matthias Jouanneaux <[email protected]> helix single layer: fixed test, added config submit script, improved parsing Signed-off-by: Matthias Jouanneaux <[email protected]> helix single layer: fix segment for sbatch script Signed-off-by: Matthias Jouanneaux <[email protected]> helix: fixed TP-only runs (removed hack to make higher seq len work), improved sbatch scripts Signed-off-by: Matthias Jouanneaux <[email protected]> helix: fix high node count runs, move back to e2e mode, improve parse script Signed-off-by: Matthias Jouanneaux <[email protected]> longer prompt for DSV3 Lite & DSR1 FP4 integration test disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_tllm_gen_helix disaggregated/test_disaggregated.py::test_disaggregated_deepseek_r1_fp4_tllm_gen_helix helix: added initial README for testing/benchmarking Signed-off-by: Matthias Jouanneaux <[email protected]> helix slurm: remove references to internal clusters Signed-off-by: Matthias Jouanneaux <[email protected]> minor updates to README minor updates helix: improve transpose/split for alltoall Signed-off-by: Matthias Jouanneaux <[email protected]> Revert "helix: improve transpose/split for alltoall" This reverts commit c8b24b9. helix: improve alltoall perf Signed-off-by: Matthias Jouanneaux <[email protected]> [https://nvbugs/5495789][feat] Optionally disable server GC and worker GC (NVIDIA#7995) Signed-off-by: Tailing Yuan <[email protected]> save changes for custom logging redo cherry-pick of attention.py save more changes for build and pipe-cleaning save more changes clean up - 1 clean up - 2 reuse mla_tensor_params instead of using helix_tensor_params undo all_tp_rank_num_tokens update test_disaggregated.py updates to dsv3RopeOp more cleanup save fp8 disagg test [https://nvbugs/5637012][fix] Fix helix unit tests Signed-off-by: Balaram Buddharaju <[email protected]> minor updates to attention.py updates to test - seqlen 64 works get integration test working
7eabb38 to
3d11205
Compare
Description
This MR integrates helix parallelism, an experimental feature, in TRTLLM.
Background:
Changes in this MR:
Most changes in this MR enforce this:
resource_manager.py.seq_len_kvintrtllm.pywhich is also adjusted accordingly."Repurposing" attn CP ranks to FFN TP ranks can make things quite messy. To keep this readable,
modeling_deepseekv3.pyand pass mapping without cp to the rest.communicator.pyto obtain the right TP groups.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Release Notes
New Features
--cp_sizecommand-line argument for configuring context parallel size (default: 1)Tests
✏️ Tip: You can customize this high-level summary in your review settings.