-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesScale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelism<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismquestionFurther information is requestedFurther information is requestedwaiting for feedback
Description
System Info
I'm running dynamo with tensorrt backend, and set tensor_parallel_size=2, pipeline_parallel_size=1, gpus_per_node=1, both prefill and decode process failed to run.
env info:
TensorRT-LLM version: 1.1.0rc5
GPU: A100
cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
error logs:
[2025-11-18 09:33:24] WARNING __init__.py:56: dynamo.nixl_connect: Failed to load CuPy for GPU acceleration, utilizing numpy to provide CPU based operations.
2025-11-18T09:33:24.094006Z WARN dynamo_runtime::config: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS is deprecated and no longer used. System health is now determined by endpoints that register with health check payloads. Please update your configuration to register health check payloads directly on endpoints.
2025-11-18T09:33:24.519528Z WARN dynamo_runtime::config: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS is deprecated and no longer used. System health is now determined by endpoints that register with health check payloads. Please update your configuration to register health check payloads directly on endpoints.
2025-11-18T09:33:24.520302Z INFO dynamo_runtime::system_status_server: [spawn_system_status_server] binding to: 0.0.0.0:9090
2025-11-18T09:33:24.520339Z INFO dynamo_runtime::system_status_server: [spawn_system_status_server] system status server bound to: 0.0.0.0:9090
2025-11-18T09:33:24.520365Z INFO dynamo_runtime::distributed: System status server started successfully on 0.0.0.0:9090
2025-11-18T09:33:24.520753Z INFO main.worker: Signal handlers set up for graceful shutdown
2025-11-18T09:33:24.521962Z INFO main.init: Initializing the worker with config: Config(namespace=test-trt, component=tensorrt_llm_next, endpoint=generate, model_path=Qwen/Qwen3-0.6B, served_model_name=Qwen/Qwen3-0.6B, tensor_parallel_size=1, pipeline_parallel_size=1, expert_parallel_size=None, kv_block_size=32, gpus_per_node=None, max_batch_size=2048, max_num_tokens=8192, max_seq_len=None, max_beam_width=1, free_gpu_memory_fraction=None, extra_engine_args=engine_configs/prefill.yaml, override_engine_args=, migration_limit=0, publish_events_and_metrics=False, disaggregation_mode=DisaggregationMode.PREFILL, disaggregation_strategy=DisaggregationStrategy.DECODE_FIRST, next_endpoint=, encode_endpoint=, modality=text, allowed_local_media_path=, max_file_size_mb=50, reasoning_parser=None, tool_call_parser=None, dump_config_to=None,custom_jinja_template=None
2025-11-18T09:33:24.525393Z INFO main.init: TensorRT-LLM engine args: {'model': 'Qwen/Qwen3-0.6B', 'scheduler_config': SchedulerConfig(capacity_scheduler_policy=<CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: 'GUARANTEED_NO_EVICT'>, context_chunking_policy=None, dynamic_batch_config=DynamicBatchConfig(enable_batch_size_tuning=True, enable_max_num_tokens_tuning=False, dynamic_batch_moving_average_window=128)), 'tensor_parallel_size': 1, 'pipeline_parallel_size': 2, 'moe_expert_parallel_size': 1, 'backend': 'pytorch', 'skip_tokenizer_init': True, 'build_config': BuildConfig(max_input_len=1024, max_seq_len=None, opt_batch_size=8, max_batch_size=2048, max_beam_width=1, max_num_tokens=8192, opt_num_tokens=None, max_prompt_embedding_table_size=0, kv_cache_type=None, gather_context_logits=False, gather_generation_logits=False, strongly_typed=True, force_num_profiles=None, profiling_verbosity='layer_names_only', enable_debug_output=False, max_draft_len=0, speculative_decoding_mode=<SpeculativeDecodingMode.NONE: 1>, use_refit=False, input_timing_cache=None, output_timing_cache='model.cache', lora_config=LoraConfig(lora_dir=[], lora_ckpt_source='hf', max_lora_rank=64, lora_target_modules=[], trtllm_modules_to_hf_modules={}, max_loras=None, max_cpu_loras=None, swap_gate_up_proj_lora_b_weight=True), auto_parallel_config=AutoParallelConfig(world_size=1, gpus_per_node=8, cluster_key=None, cluster_info=None, sharding_cost_model=<CostModel.ALPHA_BETA: 'alpha_beta'>, comm_cost_model=<CostModel.ALPHA_BETA: 'alpha_beta'>, enable_pipeline_parallelism=False, enable_shard_unbalanced_shape=False, enable_shard_dynamic_shape=False, enable_reduce_scatter=True, builder_flags=None, debug_mode=False, infer_shape=True, validation_mode=False, same_buffer_io={}, same_spec_io={}, sharded_io_allowlist=[], fill_weights=False, parallel_config_cache=None, profile_cache=None, dump_path=None, debug_outputs=[]), weight_sparsity=False, weight_streaming=False, plugin_config=PluginConfig(_dtype='float16', _bert_attention_plugin='auto', _gpt_attention_plugin='auto', _gemm_plugin=None, _explicitly_disable_gemm_plugin=False, _gemm_swiglu_plugin=None, _fp8_rowwise_gemm_plugin=None, _qserve_gemm_plugin=None, _identity_plugin=None, _nccl_plugin='auto', _lora_plugin=None, _dora_plugin=False, _weight_only_groupwise_quant_matmul_plugin=None, _weight_only_quant_matmul_plugin=None, _smooth_quant_plugins=True, _smooth_quant_gemm_plugin=None, _layernorm_quantization_plugin=None, _rmsnorm_quantization_plugin=None, _quantize_per_token_plugin=False, _quantize_tensor_plugin=False, _moe_plugin='auto', _mamba_conv1d_plugin='auto', _low_latency_gemm_plugin=None, _low_latency_gemm_swiglu_plugin=None, _gemm_allreduce_plugin=None, _context_fmha=True, _bert_context_fmha_fp32_acc=False, _paged_kv_cache=None, _remove_input_padding=True, _norm_quant_fusion=False, _reduce_fusion=False, _user_buffer=False, _tokens_per_block=32, _use_paged_context_fmha=True, _use_fp8_context_fmha=True, _fuse_fp4_quant=False, _multiple_profiles=False, _paged_state=True, _streamingllm=False, _manage_weights=False, _use_fused_mlp=True, _pp_reduce_scatter=False), use_strip_plan=False, max_encoder_input_len=1024, dry_run=False, visualize_network=None, monitor_memory=False, use_mrope=False), 'kv_cache_config': {'free_gpu_memory_fraction': 0.8}, 'gpus_per_node': 1, 'max_num_tokens': 8192, 'max_seq_len': None, 'max_beam_width': 1, 'max_batch_size': 2048, 'return_perf_metrics': False, 'enable_attention_dp': False, 'trust_remote_code': True, 'enable_chunked_prefill': True, 'disable_overlap_scheduler': True, 'cache_transceiver_config': {'backend': 'DEFAULT'}}
2025-11-18T09:33:27.122539Z INFO main.init: Initializing NIXL Connect.
2025-11-18 09:33:27 NIXL INFO _api.py:361 Backend UCX was instantiated
2025-11-18 09:33:27 NIXL INFO _api.py:251 Initialized NIXL agent: aeaf719c06814f99953f6cf3dc3c5794
[11/18/2025-09:33:27] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info
[11/18/2025-09:33:27] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/18/2025-09:33:27] [TRT-LLM] [I] Set nccl_plugin to None.
[11/18/2025-09:33:27] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[11/18/2025-09:33:27] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info
[11/18/2025-09:33:27] [TRT-LLM] [I] start MpiSession with 2 workers
rank 0 using MpiCommSession to bind to external MPI processes
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[11/18/2025-09:33:29] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[11/18/2025-09:33:30] [TRT-LLM] [RANK 0] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=True, moe_max_num_tokens=None, moe_load_balancer=None, attention_dp_enable_balance=False, attention_dp_time_out_iters=50, attention_dp_batching_wait_iters=10, batch_wait_timeout_ms=0, attn_backend='TRTLLM', moe_backend='CUTLASS', moe_disable_finalize_fusion=False, enable_mixed_sampler=False, sampler_type=<SamplerType.auto: 'auto'>, kv_cache_dtype='auto', mamba_ssm_cache_dtype='auto', enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_piecewise_cuda_graph_num_tokens=None, torch_compile_enable_userbuffers=True, torch_compile_max_num_streams=1, enable_autotuner=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, enable_min_latency=False, allreduce_strategy='AUTO', stream_interval=1, force_dynamic_quantization=False, mm_encoder_only=False, _limit_torch_cuda_mem_fraction=True)
[11/18/2025-09:33:30] [TRT-LLM] [RANK 0] [I] ATTENTION RUNTIME FEATURES: AttentionRuntimeFeatures(chunked_prefill=True, cache_reuse=True, has_speculative_draft_tokens=False, chunk_size=8192, chunked_prefill_buffer_batch_size=4)
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Bootstrap: Using eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO cudaDriverVersion 12090
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO cudaDriverVersion 12090
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NCCL version 2.27.5+cuda12.9
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Bootstrap: Using eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NCCL version 2.27.5+cuda12.9
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NET/IB : No device found.
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Initialized NET plugin Socket
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Assigned NET plugin Socket to comm
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Using network Socket
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO ncclCommInitRank comm 0x7a12257c8780 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaa9becea043c4e6c - Init START
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NET/IB : No device found.
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.133.59<0>
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Initialized NET plugin Socket
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Assigned NET plugin Socket to comm
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Using network Socket
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO ncclCommInitRank comm 0x397e3800 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaa9becea043c4e6c - Init START
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO RAS client listening socket at ::1<28028>
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO RAS client listening socket at ::1<28028>
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO Bootstrap timings total 0.000994 (create 0.000044, send 0.000122, recv 0.000272, ring 0.000038, delay 0.000000)
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO Bootstrap timings total 0.011595 (create 0.000042, send 0.000138, recv 0.010765, ring 0.000036, delay 0.000000)
[2025-11-18 09:33:30] test-trt-prefill-worker-0-0:42:982 [0] init.cc:737 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 100000
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO init.cc:1449 -> 5
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO init.cc:1832 -> 5
[2025-11-18 09:33:30] test-trt-prefill-worker-0-0:43:43 [0] init.cc:737 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 100000
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO init.cc:1449 -> 5
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO init.cc:1832 -> 5
test-trt-prefill-worker-0-0:43:43 [0] NCCL INFO init.cc:1858 -> 5
test-trt-prefill-worker-0-0:42:982 [0] NCCL INFO init.cc:1858 -> 5
[11/18/2025-09:33:30] [TRT-LLM] [RANK 1] [E] Failed to initialize executor on rank 1: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)'
(../tensorrt_llm/runtime/ncclCommunicator.cpp:90)
1 0x79bcb3a2a0f4 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 76
2 0x79bc639171ec tensorrt_llm::runtime::NcclCommunicator::createComm(int, int, tensorrt_llm::mpi::MpiComm const&) + 316
3 0x79bcb3dd0e3e torch_ext::NcclCommunicatorOp::NcclCommunicatorOp(long, long) + 110
4 0x79bcb3dd2ab6 std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<torch_ext::NcclCommunicatorOp>::defineMethod<torch::class_<torch_ext::NcclCommunicatorOp>::def<long, long>(torch::detail::types<void, long, long>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(c10::tagged_capsule<torch_ext::NcclCommunicatorOp>, long, long)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, torch::class_<torch_ext::NcclCommunicatorOp>::def<long, long>(torch::detail::types<void, long, long>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(c10::tagged_capsule<torch_ext::NcclCommunicatorOp>, long, long)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 118
5 0x79bec4e29444 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95d444) [0x79bec4e29444]
6 0x79bec4e29cc5 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95dcc5) [0x79bec4e29cc5]
7 0x79bec4e2731e /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95b31e) [0x79bec4e2731e]
8 0x79bec4e2a090 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95e090) [0x79bec4e2a090]
9 0x79bec485037d /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x38437d) [0x79bec485037d]
10 0x581a6f python3() [0x581a6f]
11 0x5492f5 _PyObject_MakeTpCall + 117
12 0x54d017 python3() [0x54d017]
13 0x5a31ea python3() [0x5a31ea]
14 0x5492f5 _PyObject_MakeTpCall + 117
15 0x5d68bf _PyEval_EvalFrameDefault + 2783
16 0x54ab42 _PyObject_Call_Prepend + 194
17 0x59da4f python3() [0x59da4f]
18 0x599513 python3() [0x599513]
19 0x5492f5 _PyObject_MakeTpCall + 117
20 0x5d68bf _PyEval_EvalFrameDefault + 2783
21 0x54ac0a _PyObject_Call_Prepend + 394
22 0x59da4f python3() [0x59da4f]
23 0x599513 python3() [0x599513]
24 0x5493be _PyObject_MakeTpCall + 318
25 0x5d68bf _PyEval_EvalFrameDefault + 2783
26 0x54ac0a _PyObject_Call_Prepend + 394
27 0x59da4f python3() [0x59da4f]
28 0x599513 python3() [0x599513]
29 0x5493be _PyObject_MakeTpCall + 318
30 0x5d68bf _PyEval_EvalFrameDefault + 2783
31 0x54cea2 python3() [0x54cea2]
32 0x5db1e4 _PyEval_EvalFrameDefault + 21508
33 0x5d4dab PyEval_EvalCode + 347
34 0x5d2bac python3() [0x5d2bac]
35 0x5818ed python3() [0x5818ed]
36 0x549cf5 PyObject_Vectorcall + 53
37 0x5d68bf _PyEval_EvalFrameDefault + 2783
38 0x6bc192 python3() [0x6bc192]
39 0x6bbdc2 Py_RunMain + 562
40 0x6bba2d Py_BytesMain + 45
41 0x79bed11c21ca /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x79bed11c21ca]
42 0x79bed11c228b __libc_start_main + 139
43 0x656a35 _start + 37
[11/18/2025-09:33:30] [TRT-LLM] [RANK 0] [E] Failed to initialize executor on rank 0: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)'
(../tensorrt_llm/runtime/ncclCommunicator.cpp:90)
1 0x7a14a782a0f4 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 76
2 0x7a14577171ec tensorrt_llm::runtime::NcclCommunicator::createComm(int, int, tensorrt_llm::mpi::MpiComm const&) + 316
3 0x7a14a7bd0e3e torch_ext::NcclCommunicatorOp::NcclCommunicatorOp(long, long) + 110
4 0x7a14a7bd2ab6 std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<torch_ext::NcclCommunicatorOp>::defineMethod<torch::class_<torch_ext::NcclCommunicatorOp>::def<long, long>(torch::detail::types<void, long, long>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(c10::tagged_capsule<torch_ext::NcclCommunicatorOp>, long, long)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, torch::class_<torch_ext::NcclCommunicatorOp>::def<long, long>(torch::detail::types<void, long, long>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(c10::tagged_capsule<torch_ext::NcclCommunicatorOp>, long, long)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 118
5 0x7a16b8744444 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95d444) [0x7a16b8744444]
6 0x7a16b8744cc5 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95dcc5) [0x7a16b8744cc5]
7 0x7a16b874231e /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95b31e) [0x7a16b874231e]
8 0x7a16b8745090 /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x95e090) [0x7a16b8745090]
9 0x7a16b816b37d /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x38437d) [0x7a16b816b37d]
10 0x581a6f python3() [0x581a6f]
11 0x5492f5 _PyObject_MakeTpCall + 117
12 0x54d017 python3() [0x54d017]
13 0x5a31ea python3() [0x5a31ea]
14 0x5492f5 _PyObject_MakeTpCall + 117
15 0x5d68bf _PyEval_EvalFrameDefault + 2783
16 0x54ab42 _PyObject_Call_Prepend + 194
17 0x59da4f python3() [0x59da4f]
18 0x599513 python3() [0x599513]
19 0x5492f5 _PyObject_MakeTpCall + 117
20 0x5d68bf _PyEval_EvalFrameDefault + 2783
21 0x54ac0a _PyObject_Call_Prepend + 394
22 0x59da4f python3() [0x59da4f]
23 0x599513 python3() [0x599513]
24 0x5493be _PyObject_MakeTpCall + 318
25 0x5d68bf _PyEval_EvalFrameDefault + 2783
26 0x54ac0a _PyObject_Call_Prepend + 394
27 0x59da4f python3() [0x59da4f]
28 0x599513 python3() [0x599513]
29 0x5493be _PyObject_MakeTpCall + 318
30 0x5d68bf _PyEval_EvalFrameDefault + 2783
31 0x54cea2 python3() [0x54cea2]
32 0x6f7a8c python3() [0x6f7a8c]
33 0x6b862c python3() [0x6b862c]
34 0x7a16c4b19aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7a16c4b19aa4]
35 0x7a16c4ba6c6c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c6c) [0x7a16c4ba6c6c]
[11/18/2025-09:33:30] [TRT-LLM] [RANK 1] [E] Traceback (most recent call last):
File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 852, in worker_main
worker: GenerationExecutorWorker = worker_cls(
^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
self.engine = _create_py_executor(
^^^^^^^^^^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
_executor = create_executor(**args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 262, in create_py_executor
model_engine = PyTorchModelEngine(
^^^^^^^^^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 292, in __init__
init_pp_comm(mapping)
File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/distributed/communicator.py", line 247, in init_pp_comm
_pp_comm = PPComm(mapping)
^^^^^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/distributed/communicator.py", line 225, in __init__
self.nccl_comm = torch.classes.trtllm.NcclCommunicatorOp(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Failed, NCCL error ../tensorrt_llm/runtime/ncclCommunicator.cpp:90 'invalid usage (run with NCCL_DEBUG=WARN for details)'
How would you like to use TensorRT-LLM
I want to run inference of a [specific model](put Hugging Face link here). I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.
Specific questions:
- Model:
- Use case (e.g., chatbot, batch inference, real-time serving):
- Expected throughput/latency requirements:
- Multi-GPU setup needed:
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesScale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelism<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismquestionFurther information is requestedFurther information is requestedwaiting for feedback