-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Hardware Used
Trainium
Training/Inference
Training
Instance type
trn1.32xlarge
Release Artifacts
NeuronX Compiler version 2.19.8089.0+8ab9f450
Python version 3.10.12
HWM version 2.19.0.8089+8ab9f450
NumPy version 1.26.4
Running on AMI ami-0034e664023874efb
Model Type
Custom model (OpenAI's VPT model with minor changes)
Description
I'm trying to get OpenAI's VPT model to run with dummy data on a trainium machine. I'm getting compilation error which I couldn't resolve on my own. Please see the attached log for the exact error. I have tried to run it on CPU, it seems to be working. I have tried to run based on the PyTorch NeuronX single-worker training/evaluation quick-start, but it's possible I have missed something.
Steps to Reproduce
- Setup a trainium instance (trn1.32xlarge) in AWS EC2. I have used an ubuntu machine with Deep Learning AMI Neuron (Ubuntu 22.04) 20250718.
- Clone this repository, it contains the few changes I tried to make: https://github.com/tlovas/VPT-XLA-Experiment/tree/tlovas/XLA-test
- Use the 'source /opt/aws_neuronx_venv_pytorch_2_7/bin/activate', and install the missing packages.
- Run the test file I create: test_policy.py.
- The model compilation should fail at this point.
Attach logs if any
2025-09-03 12:21:32.000893: 8538 ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.neff', '--target=trn1', '--verbose=35']: 2025-09-03T12:21:32Z [TEN404] (_batch-norm-training.1122) Internal tensorizer error: FlattenMacroLoop:Pelican exception: Value is finalized before all edges are gone at /local/p4clients/pkgbuild-const/workspace/src/KaenaCompiler/neuronxcc/pelican/include/pelican/IR/Value.h:LINE: _users.empty() - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
2025-09-03 12:21:32.000893: 8538 ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb after 0 retries.
Traceback (most recent call last):
File "/home/ubuntu/vpt/test_policy.py", line 145, in
main()
File "/home/ubuntu/vpt/test_policy.py", line 140, in main
test_vpt_with_xla(0)
File "/home/ubuntu/vpt/test_policy.py", line 88, in test_vpt_with_xla
train_vpt(agent, model, optimizer, train_loader, device)
File "/home/ubuntu/vpt/test_policy.py", line 135, in train_vpt
xm.mark_step()
File "/opt/aws_neuronx_venv_pytorch_2_7/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1051, in mark_step
torch_xla._XLAC._xla_step_marker(
RuntimeError: Bad StatusOr access: INTERNAL: RunNeuronCCImpl: error condition error != 0: <class 'subprocess.CalledProcessError'>: Command '['neuronx-cc', 'compile', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.neff', '--target=trn1', '--verbose=35']' returned non-zero exit status 70.