[Bug]: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)'

### System Info

I'm running dynamo with tensorrt backend, and set tensor_parallel_size=2, pipeline_parallel_size=1, gpus_per_node=1, both prefill and decode process failed to run.
env info:
```
TensorRT-LLM version: 1.1.0rc5
GPU: A100
cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
```
error logs:
```
[2025-11-18 09:33:24] WARNING __init__.py:56: dynamo.nixl_connect: Failed to load CuPy for GPU acceleration, utilizing numpy to provide CPU based operations.
2025-11-18T09:33:24.094006Z  WARN dynamo_runtime::config: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS is deprecated and no longer used. System health is now determined by endpoints that register with health check payloads. Please update your configuration to register health check payloads directly on endpoints.
2025-11-18T09:33:24.519528Z  WARN dynamo_runtime::config: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS is deprecated and no longer used. System health is now determined by endpoints that register with health check payloads. Please update your configuration to register health check payloads directly on endpoints.
2025-11-18T09:33:24.520302Z  INFO dynamo_runtime::system_status_server: [spawn_system_status_server] binding to: 0.0.0.0:9090
2025-11-18T09:33:24.520339Z  INFO dynamo_runtime::system_status_server: [spawn_system_status_server] system status server bound to: 0.0.0.0:9090
2025-11-18T09:33:24.520365Z  INFO dynamo_runtime::distributed: System status server started successfully on 0.0.0.0:9090
2025-11-18T09:33:24.520753Z  INFO main.worker: Signal handlers set up for graceful shutdown
2025-11-18T09:33:24.521962Z  INFO main.init: Initializing the worker with config: Config(namespace=test-trt, component=tensorrt_llm_next, endpoint=generate, model_path=Qwen/Qwen3-0.6B, served_model_name=Qwen/Qwen3-0.6B, tensor_parallel_size=1, pipeline_parallel_size=1, expert_parallel_size=None, kv_block_size=32, gpus_per_node=None, max_batch_size=2048, max_num_tokens=8192, max_seq_len=None, max_beam_width=1, free_gpu_memory_fraction=None, extra_engine_args=engine_configs/prefill.yaml, override_engine_args=, migration_limit=0, publish_events_and_metrics=False, disaggregation_mode=DisaggregationMode.PREFILL, disaggregation_strategy=DisaggregationStrategy.DECODE_FIRST, next_endpoint=, encode_endpoint=, modality=text, allowed_local_media_path=, max_file_size_mb=50, reasoning_parser=None, tool_call_parser=None, dump_config_to=None,custom_jinja_template=None
2025-11-18T09:33:24.525393Z  INFO main.init: TensorRT-LLM engine args: {'model': 'Qwen/Qwen3-0.6B', 'scheduler_config': SchedulerConfig(capacity_scheduler_policy=<CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: 'GUARANTEED_NO_EVICT'>, context_chunking_policy=None, dynamic_batch_config=DynamicBatchConfig(enable_batch_size_tuning=True, enable_max_num_tokens_tuning=False, dynamic_batch_moving_average_window=128)), 'tensor_parallel_size': 1, 'pipeline_parallel_size': 2, 'moe_expert_parallel_size': 1, 'backend': 'pytorch', 'skip_tokenizer_init': True, 'build_config': BuildConfig(max_input_len=1024, max_seq_len=None, opt_batch_size=8, max_batch_size=2048, max_beam_width=1, max_num_tokens=8192, opt_num_tokens=None, max_prompt_embedding_table_size=0, kv_cache_type=None, gather_context_logits=False, gather_generation_logits=False, strongly_typed=True, force_num_profiles=None, profiling_verbosity='layer_names_only', enable_debug_output=False, max_draft_len=0, speculative_decoding_mode=<SpeculativeDecodingMode.NONE: 1>, use_refit=False, input_timing_cache=None, output_timing_cache='model.cache', lora_config=LoraConfig(lora_dir=[], lora_ckpt_source='hf', max_lora_rank=64, lora_target_modules=[], trtllm_modules_to_hf_modules={}, max_loras=None, max_cpu_loras=None, swap_gate_up_proj_lora_b_weight=True), auto_parallel_config=AutoParallelConfig(world_size=1, gpus_per_node=8, cluster_key=None, cluster_info=None, sharding_cost_model=<CostModel.ALPHA_BETA: 'alpha_beta'>, comm_cost_model=<CostModel.ALPHA_BETA: 'alpha_beta'>, enable_pipeline_parallelism=False, enable_shard_unbalanced_shape=False, enable_shard_dynamic_shape=False, enable_reduce_scatter=True, builder_flags=None, debug_mode=False, infer_shape=True, validation_mode=False, same_buffer_io={}, same_spec_io={}, sharded_io_allowlist=[], fill_weights=False, parallel_config_cache=None, profile_cache=None, dump_path=None, debug_outputs=[]), weight_sparsity=False, weight_streaming=False, plugin_config=PluginConfig(_dtype='float16', _bert_attention_plugin='auto', _gpt_attention_plugin='auto', _gemm_plugin=None, _explicitly_disable_gemm_plugin=False, _gemm_swiglu_plugin=None, _fp8_rowwise_gemm_plugin=None, _qserve_gemm_plugin=None, _identity_plugin=None, _nccl_plugin='auto', _lora_plugin=None, _dora_plugin=False, _weight_only_groupwise_quant_matmul_plugin=None, _weight_only_quant_matmul_plugin=None, _smooth_quant_plugins=True, _smooth_quant_gemm_plugin=None, _layernorm_quantization_plugin=None, _rmsnorm_quantization_plugin=None, _quantize_per_token_plugin=False, _quantize_tensor_plugin=False, _moe_plugin='auto', _mamba_conv1d_plugin='auto', _low_latency_gemm_plugin=None, _low_latency_gemm_swiglu_plugin=None, _gemm_allreduce_plugin=None, _context_fmha=True, _bert_context_fmha_fp32_acc=False, _paged_kv_cache=None, _remove_input_padding=True, _norm_quant_fusion=False, _reduce_fusion=False, _user_buffer=False, _tokens_per_block=32, _use_paged_context_fmha=True, _use_fp8_context_fmha=True, _fuse_fp4_quant=False, _multiple_profiles=False, _paged_state=True, _streamingllm=False, _manage_weights=False, _use_fused_mlp=True, _pp_reduce_scatter=False), use_strip_plan=False, max_encoder_input_len=1024, dry_run=False, visualize_network=None, monitor_memory=False, use_mrope=False), 'kv_cache_config': {'free_gpu_memory_fraction': 0.8}, 'gpus_per_node': 1, 'max_num_tokens': 8192, 'max_seq_len': None, 'max_beam_width': 1, 'max_batch_size': 2048, 'return_perf_metrics': False, 'enable_attention_dp': False, 'trust_remote_code': True, 'enable_chunked_prefill': True, 'disable_overlap_scheduler': True, 'cache_transceiver_config': {'backend': 'DEFAULT'}}
2025-11-18T09:33:27.122539Z  INFO main.init: Initializing NIXL Connect.
2025-11-18 09:33:27 NIXL INFO    _api.py:361 Backend UCX was instantiated
2025-11-18 09:33:27 NIXL INFO    _api.py:251 Initialized NIXL agent: aeaf719c06814f99953f6cf3dc3c5794
[11/18/2025-09:33:27] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info
[11/18/2025-09:33:27] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/18/2025-09:33:27] [TRT-LLM] [I] Set nccl_plugin to None.
[11/18/2025-09:33:27] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[11/18/2025-09:33:27] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info
[11/18/2025-09:33:27] [TRT-LLM] [I] start MpiSession with 2 workers
rank 0 using MpiCommSession to bind to external MPI processes
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[11/18/2025-09:33:30] [TRT-LLM] [RANK 0] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=True, moe_max_num_tokens=None, moe_load_balancer=None, attention_dp_enable_balance=False, attention_dp_time_out_iters=50, attention_dp_batching_wait_iters=10, batch_wait_timeout_ms=0, attn_backend='TRTLLM', moe_backend='CUTLASS', moe_disable_finalize_fusion=False, enable_mixed_sampler=False, sampler_type=<SamplerType.auto: 'auto'>, kv_cache_dtype='auto', mamba_ssm_cache_dtype='auto', enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_piecewise_cuda_graph_num_tokens=None, torch_compile_enable_userbuffers=True, torch_compile_max_num_streams=1, enable_autotuner=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, enable_min_latency=False, allreduce_strategy='AUTO', stream_interval=1, force_dynamic_quantization=False, mm_encoder_only=False, _limit_torch_cuda_mem_fraction=True)
[11/18/2025-09:33:30] [TRT-LLM] [RANK 0] [I] ATTENTION RUNTIME FEATURES:  AttentionRuntimeFeatures(chunked_prefill=True, cache_reuse=True, has_speculative_draft_tokens=False, chunk_size=8192, chunked_prefill_buffer_batch_size=4)
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Bootstrap: Using eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO cudaDriverVersion 12090
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO cudaDriverVersion 12090
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NCCL version 2.27.5+cuda12.9
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Bootstrap: Using eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NCCL version 2.27.5+cuda12.9
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NET/IB : No device found.
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Initialized NET plugin Socket
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Assigned NET plugin Socket to comm
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Using network Socket
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO ncclCommInitRank comm 0x7a12257c8780 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaa9becea043c4e6c - Init START
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NET/IB : No device found.
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Initialized NET plugin Socket
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Assigned NET plugin Socket to comm
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Using network Socket
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO ncclCommInitRank comm 0x397e3800 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaa9becea043c4e6c - Init START
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO RAS client listening socket at ::1<28028>
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO RAS client listening socket at ::1<28028>
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Bootstrap timings total 0.000994 (create 0.000044, send 0.000122, recv 0.000272, ring 0.000038, delay 0.000000)
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Bootstrap timings total 0.011595 (create 0.000042, send 0.000138, recv 0.010765, ring 0.000036, delay 0.000000)

[2025-11-18 09:33:30] test-trt-prefill-worker-0-0:42:982 [0] init.cc:737 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 100000
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO init.cc:1449 -> 5
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO init.cc:1832 -> 5

[2025-11-18 09:33:30] test-trt-prefill-worker-0-0:43:43 [0] init.cc:737 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 100000
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO init.cc:1449 -> 5
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO init.cc:1832 -> 5
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO init.cc:1858 -> 5
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO init.cc:1858 -> 5
[11/18/2025-09:33:30] [TRT-LLM] [RANK 1] [E] Failed to initialize executor on rank 1: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)'
 (../tensorrt_llm/runtime/ncclCommunicator.cpp:90)
1       0x79bcb3a2a0f4 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 76
2       0x79bc639171ec tensorrt_llm::runtime::NcclCommunicator::createComm(int, int, tensorrt_llm::mpi::MpiComm const&) + 316
3       0x79bcb3dd0e3e torch_ext::NcclCommunicatorOp::NcclCommunicatorOp(long, long) + 110
4       0x79bcb3dd2ab6 std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<torch_ext::NcclCommunicatorOp>::defineMethod<torch::class_<torch_ext::NcclCommunicatorOp>::def<long, long>(torch::detail::types<void, long, long>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(c10::tagged_capsule<torch_ext::NcclCommunicatorOp>, long, long)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, torch::class_<torch_ext::NcclCommunicatorOp>::def<long, long>(torch::detail::types<void, long, long>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(c10::tagged_capsule<torch_ext::NcclCommunicatorOp>, long, long)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 118
5       0x79bec4e29444 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95d444) [0x79bec4e29444]
6       0x79bec4e29cc5 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95dcc5) [0x79bec4e29cc5]
7       0x79bec4e2731e /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95b31e) [0x79bec4e2731e]
8       0x79bec4e2a090 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95e090) [0x79bec4e2a090]
9       0x79bec485037d /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x38437d) [0x79bec485037d]
10            0x581a6f python3() [0x581a6f]
11            0x5492f5 _PyObject_MakeTpCall + 117
12            0x54d017 python3() [0x54d017]
13            0x5a31ea python3() [0x5a31ea]
14            0x5492f5 _PyObject_MakeTpCall + 117
15            0x5d68bf _PyEval_EvalFrameDefault + 2783
16            0x54ab42 _PyObject_Call_Prepend + 194
17            0x59da4f python3() [0x59da4f]
18            0x599513 python3() [0x599513]
19            0x5492f5 _PyObject_MakeTpCall + 117
20            0x5d68bf _PyEval_EvalFrameDefault + 2783
21            0x54ac0a _PyObject_Call_Prepend + 394
22            0x59da4f python3() [0x59da4f]
23            0x599513 python3() [0x599513]
24            0x5493be _PyObject_MakeTpCall + 318
25            0x5d68bf _PyEval_EvalFrameDefault + 2783
26            0x54ac0a _PyObject_Call_Prepend + 394
27            0x59da4f python3() [0x59da4f]
28            0x599513 python3() [0x599513]
29            0x5493be _PyObject_MakeTpCall + 318
30            0x5d68bf _PyEval_EvalFrameDefault + 2783
31            0x54cea2 python3() [0x54cea2]
32            0x5db1e4 _PyEval_EvalFrameDefault + 21508
33            0x5d4dab PyEval_EvalCode + 347
34            0x5d2bac python3() [0x5d2bac]
35            0x5818ed python3() [0x5818ed]
36            0x549cf5 PyObject_Vectorcall + 53
37            0x5d68bf _PyEval_EvalFrameDefault + 2783
38            0x6bc192 python3() [0x6bc192]
39            0x6bbdc2 Py_RunMain + 562
40            0x6bba2d Py_BytesMain + 45
41      0x79bed11c21ca /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x79bed11c21ca]
42      0x79bed11c228b __libc_start_main + 139
43            0x656a35 _start + 37
[11/18/2025-09:33:30] [TRT-LLM] [RANK 0] [E] Failed to initialize executor on rank 0: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)'
 (../tensorrt_llm/runtime/ncclCommunicator.cpp:90)
1       0x7a14a782a0f4 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 76
2       0x7a14577171ec tensorrt_llm::runtime::NcclCommunicator::createComm(int, int, tensorrt_llm::mpi::MpiComm const&) + 316
3       0x7a14a7bd0e3e torch_ext::NcclCommunicatorOp::NcclCommunicatorOp(long, long) + 110
4       0x7a14a7bd2ab6 std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<torch_ext::NcclCommunicatorOp>::defineMethod<torch::class_<torch_ext::NcclCommunicatorOp>::def<long, long>(torch::detail::types<void, long, long>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(c10::tagged_capsule<torch_ext::NcclCommunicatorOp>, long, long)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, torch::class_<torch_ext::NcclCommunicatorOp>::def<long, long>(torch::detail::types<void, long, long>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(c10::tagged_capsule<torch_ext::NcclCommunicatorOp>, long, long)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 118
5       0x7a16b8744444 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95d444) [0x7a16b8744444]
6       0x7a16b8744cc5 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95dcc5) [0x7a16b8744cc5]
7       0x7a16b874231e /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95b31e) [0x7a16b874231e]
8       0x7a16b8745090 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95e090) [0x7a16b8745090]
9       0x7a16b816b37d /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x38437d) [0x7a16b816b37d]
10            0x581a6f python3() [0x581a6f]
11            0x5492f5 _PyObject_MakeTpCall + 117
12            0x54d017 python3() [0x54d017]
13            0x5a31ea python3() [0x5a31ea]
14            0x5492f5 _PyObject_MakeTpCall + 117
15            0x5d68bf _PyEval_EvalFrameDefault + 2783
16            0x54ab42 _PyObject_Call_Prepend + 194
17            0x59da4f python3() [0x59da4f]
18            0x599513 python3() [0x599513]
19            0x5492f5 _PyObject_MakeTpCall + 117
20            0x5d68bf _PyEval_EvalFrameDefault + 2783
21            0x54ac0a _PyObject_Call_Prepend + 394
22            0x59da4f python3() [0x59da4f]
23            0x599513 python3() [0x599513]
24            0x5493be _PyObject_MakeTpCall + 318
25            0x5d68bf _PyEval_EvalFrameDefault + 2783
26            0x54ac0a _PyObject_Call_Prepend + 394
27            0x59da4f python3() [0x59da4f]
28            0x599513 python3() [0x599513]
29            0x5493be _PyObject_MakeTpCall + 318
30            0x5d68bf _PyEval_EvalFrameDefault + 2783
31            0x54cea2 python3() [0x54cea2]
32            0x6f7a8c python3() [0x6f7a8c]
33            0x6b862c python3() [0x6b862c]
34      0x7a16c4b19aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7a16c4b19aa4]
35      0x7a16c4ba6c6c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c6c) [0x7a16c4ba6c6c]
[11/18/2025-09:33:30] [TRT-LLM] [RANK 1] [E] Traceback (most recent call last):
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 852, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 262, in create_py_executor
    model_engine = PyTorchModelEngine(
                   ^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 292, in __init__
    init_pp_comm(mapping)
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/distributed/communicator.py", line 247, in init_pp_comm
    _pp_comm = PPComm(mapping)
               ^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/distributed/communicator.py", line 225, in __init__
    self.nccl_comm = torch.classes.trtllm.NcclCommunicatorOp(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)'
```

### How would you like to use TensorRT-LLM

I want to run inference of a [specific model](put Hugging Face link here). I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.

**Specific questions:**
- Model:
- Use case (e.g., chatbot, batch inference, real-time serving):
- Expected throughput/latency requirements:
- Multi-GPU setup needed:


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)' #9263

System Info

How would you like to use TensorRT-LLM

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)' #9263

Description

System Info

How would you like to use TensorRT-LLM

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions