Tensorizer error for custom model (OpenAI VPT)

## Hardware Used

Trainium

## Training/Inference

Training

## Instance type

trn1.32xlarge

## Release Artifacts

NeuronX Compiler version 2.19.8089.0+8ab9f450
Python version 3.10.12
HWM version 2.19.0.8089+8ab9f450
NumPy version 1.26.4
Running on AMI ami-0034e664023874efb

## Model Type

Custom model (OpenAI's VPT model with minor changes)

## Description

I'm trying to get OpenAI's VPT model to run with dummy data on a trainium machine. I'm getting compilation error which I couldn't resolve on my own. Please see the attached log for the exact error. I have tried to run it on CPU, it seems to be working. I have tried to run based on the PyTorch NeuronX single-worker training/evaluation quick-start, but it's possible I have missed something.

## Steps to Reproduce

1. Setup a trainium instance (trn1.32xlarge) in AWS EC2. I have used an ubuntu machine with Deep Learning AMI Neuron (Ubuntu 22.04) 20250718.
2. Clone this repository, it contains the few changes I tried to make: https://github.com/tlovas/VPT-XLA-Experiment/tree/tlovas/XLA-test
3. Use the 'source /opt/aws_neuronx_venv_pytorch_2_7/bin/activate', and install the missing packages.
4. Run the test file I create: test_policy.py.
5. The model compilation should fail at this point.

## Attach logs if any

2025-09-03 12:21:32.000893:  8538  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.neff', '--target=trn1', '--verbose=35']: 2025-09-03T12:21:32Z [TEN404] (_batch-norm-training.1122) Internal tensorizer error: FlattenMacroLoop:Pelican exception: Value is finalized before all edges are gone at /local/p4clients/pkgbuild-const/workspace/src/KaenaCompiler/neuronxcc/pelican/include/pelican/IR/Value.h:__LINE__: _users.empty() - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.

2025-09-03 12:21:32.000893:  8538  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb after 0 retries.
Traceback (most recent call last):
  File "/home/ubuntu/vpt/test_policy.py", line 145, in <module>
    main()
  File "/home/ubuntu/vpt/test_policy.py", line 140, in main
    test_vpt_with_xla(0)
  File "/home/ubuntu/vpt/test_policy.py", line 88, in test_vpt_with_xla
    train_vpt(agent, model, optimizer, train_loader, device)
  File "/home/ubuntu/vpt/test_policy.py", line 135, in train_vpt
    xm.mark_step()
  File "/opt/aws_neuronx_venv_pytorch_2_7/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1051, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: Bad StatusOr access: INTERNAL: RunNeuronCCImpl: error condition error != 0: <class 'subprocess.CalledProcessError'>: Command '['neuronx-cc', 'compile', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/a58081d2-90b6-4e74-9030-77537d6195b8/model.MODULE_93028970144824437+e30acd3a.neff', '--target=trn1', '--verbose=35']' returned non-zero exit status 70.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensorizer error for custom model (OpenAI VPT) #1230

Hardware Used

Training/Inference

Instance type

Release Artifacts

Model Type

Description

Steps to Reproduce

Attach logs if any

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tensorizer error for custom model (OpenAI VPT) #1230

Description

Hardware Used

Training/Inference

Instance type

Release Artifacts

Model Type

Description

Steps to Reproduce

Attach logs if any

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions