Skip to content

Tensorizer error for custom model (OpenAI VPT) #1230

@tlovas

Description

@tlovas

Hardware Used

Trainium

Training/Inference

Training

Instance type

trn1.32xlarge

Release Artifacts

NeuronX Compiler version 2.19.8089.0+8ab9f450
Python version 3.10.12
HWM version 2.19.0.8089+8ab9f450
NumPy version 1.26.4
Running on AMI ami-0034e664023874efb

Model Type

Custom model (OpenAI's VPT model with minor changes)

Description

I'm trying to get OpenAI's VPT model to run with dummy data on a trainium machine. I'm getting compilation error which I couldn't resolve on my own. Please see the attached log for the exact error. I have tried to run it on CPU, it seems to be working. I have tried to run based on the PyTorch NeuronX single-worker training/evaluation quick-start, but it's possible I have missed something.

Steps to Reproduce

  1. Setup a trainium instance (trn1.32xlarge) in AWS EC2. I have used an ubuntu machine with Deep Learning AMI Neuron (Ubuntu 22.04) 20250718.
  2. Clone this repository, it contains the few changes I tried to make: https://github.com/tlovas/VPT-XLA-Experiment/tree/tlovas/XLA-test
  3. Use the 'source /opt/aws_neuronx_venv_pytorch_2_7/bin/activate', and install the missing packages.
  4. Run the test file I create: test_policy.py.
  5. The model compilation should fail at this point.

Attach logs if any

2025-09-03 12:21:32.000893: 8538 ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.neff', '--target=trn1', '--verbose=35']: 2025-09-03T12:21:32Z [TEN404] (_batch-norm-training.1122) Internal tensorizer error: FlattenMacroLoop:Pelican exception: Value is finalized before all edges are gone at /local/p4clients/pkgbuild-const/workspace/src/KaenaCompiler/neuronxcc/pelican/include/pelican/IR/Value.h:LINE: _users.empty() - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.

2025-09-03 12:21:32.000893: 8538 ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb after 0 retries.
Traceback (most recent call last):
File "/home/ubuntu/vpt/test_policy.py", line 145, in
main()
File "/home/ubuntu/vpt/test_policy.py", line 140, in main
test_vpt_with_xla(0)
File "/home/ubuntu/vpt/test_policy.py", line 88, in test_vpt_with_xla
train_vpt(agent, model, optimizer, train_loader, device)
File "/home/ubuntu/vpt/test_policy.py", line 135, in train_vpt
xm.mark_step()
File "/opt/aws_neuronx_venv_pytorch_2_7/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1051, in mark_step
torch_xla._XLAC._xla_step_marker(
RuntimeError: Bad StatusOr access: INTERNAL: RunNeuronCCImpl: error condition error != 0: <class 'subprocess.CalledProcessError'>: Command '['neuronx-cc', 'compile', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.neff', '--target=trn1', '--verbose=35']' returned non-zero exit status 70.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions