generated from amazon-archives/__template_DevGuide
-
Notifications
You must be signed in to change notification settings - Fork 176
Open
Labels
Description
Hardware Used
[Inferentia2]
Training/Inference
[Inference]
Instance type
[Inf2.24xlarge and Inf2.48xlarge]
Release Artifacts
Ubuntu Neuron DLAMI of the latest 2.25.0 SDK release
Model Type
Flux pipeline
Description
Text encoder 2 (t5, tp_size=8) successfully compiled, but hung during the compilation of text encoder (clip). It works when I downgrade the artifacts back to neuron SDK 2.24.0.
Steps to Reproduce
With optimum-neuron installed:
optimum-cli export neuron --model black-forest-labs/FLUX.1-dev --tensor_parallel_size 8 --batch_size 1 --height 1024 --width 1024 --num_images_per_prompt 1 --sequence_length 512 --torch_dtype bfloat16 flux_dev_neuron/Attach logs if any
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:485: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:485: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:485: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:485: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:485: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:485: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:485: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:485: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
........................................................Completed run_backend_driver.
Completed run_backend_driver.
Completed run_backend_driver.
Completed run_backend_driver.
Completed run_backend_driver.
Completed run_backend_driver.
Compiler status PASS
Compiler status PASS
Compiler status PASS
Compiler status PASS
Compiler status PASS
Completed run_backend_driver.
Compiler status PASS
..Completed run_backend_driver.
Compiler status PASS
Compiler status PASS
[Compilation Time] 243.38 seconds.
***** Compiling text_encoder *****
Using Neuron: --optlevel 2
2025-08-11 14:46:14.352544: W neuron/nrt_adaptor.cc:53] nrt_tensor_write_hugepage() is not available, will fall back to nrt_tensor_write().
2025-08-11 14:46:14.352608: W neuron/nrt_adaptor.cc:62] nrt_tensor_read_hugepage() is not available, will fall back to nrt_tensor_read().
2025-Aug-11 14:46:14.0356 5123:24399 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):213 CCOM WARN NET/OFI Failed to initialize sendrecv protocol
2025-Aug-11 14:46:14.0358 5123:24399 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Aug-11 14:46:14.0360 5123:24399 [0] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
2025-Aug-11 14:46:14.0363 5123:24399 [0] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Aug-11 14:48:14.0469 5123:24399 [0] include/socket.h:556 CCOM WARN Timeout waiting for RX (waited 120 sec) - retrying, [8-rank bootstrap: rank 0 receives its next rank's rootParams from the root./If a connection error or timeout occurs here, root may be unresponsive or not yet be active./-1]
2025-Aug-11 14:48:14.0469 5123:25093 [-1] bootstrap.cc:102 CCOM WARN Timeout waiting for incoming connection (waited 120 sec), [8-rank bootstrap: root accepts and receives from each rank [CommInitRankDev]./If a connection error or timeout occurs here, some other ranks may not yet be active or may be unresponsive. The root already received from 1 out of 8 ranks/-4]
2025-Aug-11 14:50:14.0572 5123:24399 [0] include/socket.h:556 CCOM WARN Timeout waiting for RX (waited 240 sec) - retrying, [8-rank bootstrap: rank 0 receives its next rank's rootParams from the root./If a connection error or timeout occurs here, root may be unresponsive or not yet be active./-1]
2025-Aug-11 14:50:14.0578 5123:25093 [-1] bootstrap.cc:102 CCOM WARN Timeout waiting for incoming connection (waited 240 sec), [8-rank bootstrap: root accepts and receives from each rank [CommInitRankDev]./If a connection error or timeout occurs here, some other ranks may not yet be active or may be unresponsive. The root already received from 1 out of 8 ranks/-4]
2025-Aug-11 14:54:14.0777 5123:24399 [0] include/socket.h:556 CCOM WARN Timeout waiting for RX (waited 480 sec) - retrying, [8-rank bootstrap: rank 0 receives its next rank's rootParams from the root./If a connection error or timeout occurs here, root may be unresponsive or not yet be active./-1]
2025-Aug-11 14:54:14.0783 5123:25093 [-1] bootstrap.cc:102 CCOM WARN Timeout waiting for incoming connection (waited 480 sec), [8-rank bootstrap: root accepts and receives from each rank [CommInitRankDev]./If a connection error or timeout occurs here, some other ranks may not yet be active or may be unresponsive. The root already received from 1 out of 8 ranks/-4]
2025-Aug-11 15:02:15.0183 5123:24399 [0] include/socket.h:556 CCOM WARN Timeout waiting for RX (waited 960 sec) - retrying, [8-rank bootstrap: rank 0 receives its next rank's rootParams from the root./If a connection error or timeout occurs here, root may be unresponsive or not yet be active./-1]
2025-Aug-11 15:02:15.0188 5123:25093 [-1] bootstrap.cc:102 CCOM WARN Timeout waiting for incoming connection (waited 960 sec), [8-rank bootstrap: root accepts and receives from each rank [CommInitRankDev]./If a connection error or timeout occurs here, some other ranks may not yet be active or may be unresponsive. The root already received from 1 out of 8 ranks/-4]