You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Within the PyTorch 2.6 NxD Training virtual environment, we have included a setup script that installs required dependencies for the package. To run this script,
117
+
Within the PyTorch 2.7 NxD Training virtual environment, we have included a setup script that installs required dependencies for the package. To run this script,
118
118
activate the virtual environment and run ``setup_nxdt.sh`` and this will run :ref:`the setup steps here <nxdt_installation_guide>`.
119
119
120
120
You can easily get started with the multi-framework DLAMI through AWS console by following this :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>`. If you are looking to
@@ -140,25 +140,25 @@ Single Framework DLAMIs supported
140
140
- Neuron Instances Supported
141
141
- DLAMI Name
142
142
143
-
* - PyTorch 2.6
143
+
* - PyTorch 2.7
144
144
- Ubuntu 22.04
145
145
- Inf2, Trn1, Trn1n, Trn2
146
-
- Deep Learning AMI Neuron PyTorch 2.6 (Ubuntu 22.04)
146
+
- Deep Learning AMI Neuron PyTorch 2.7 (Ubuntu 22.04)
147
147
148
-
* - PyTorch 2.6
148
+
* - PyTorch 2.7
149
149
- Amazon Linux 2023
150
150
- Inf2, Trn1, Trn1n, Trn2
151
-
- Deep Learning AMI Neuron PyTorch 2.6 (Amazon Linux 2023)
151
+
- Deep Learning AMI Neuron PyTorch 2.7 (Amazon Linux 2023)
152
152
153
-
* - JAX 0.5
153
+
* - JAX 0.6
154
154
- Ubuntu 22.04
155
155
- Inf2, Trn1, Trn1n, Trn2
156
-
- Deep Learning AMI Neuron JAX 0.5 (Ubuntu 22.04)
156
+
- Deep Learning AMI Neuron JAX 0.6 (Ubuntu 22.04)
157
157
158
-
* - JAX 0.5
158
+
* - JAX 0.6
159
159
- Amazon Linux 2023
160
160
- Inf2, Trn1, Trn1n, Trn2
161
-
- Deep Learning AMI Neuron JAX 0.5 (Amazon Linux 2023)
161
+
- Deep Learning AMI Neuron JAX 0.6 (Amazon Linux 2023)
If precompilation was not done, the first execution of ./run.sh will be slower due to serial compilations. Rerunning the same script a second time would show quicker execution as the compiled graphs will be already cached in persistent cache.
92
92
@@ -108,23 +108,23 @@ Paste the following script into your terminal to create a “run_2w.sh” file a
During run, you will now notice that the "Total train batch size" is now 16 and the "Total optimization steps" is now half the number for one worker training.
130
130
@@ -149,19 +149,19 @@ Paste the following script into your terminal to create a “run_converted.sh”
If it is the first time running with ``bert-large-uncased`` model or if hyperparameters have changed, then the optional one-time precompilation step can save compilation time:
@@ -234,6 +234,7 @@ The following are currently known issues:
234
234
- Variable input sizes: When fine-tune models such as dslim/bert-base-NER using the `token-classification example <https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification>`__, you may encounter timeouts (lots of "socket.h:524 CCOM WARN Timeout waiting for RX" messages) and execution hang. This occurs because NER dataset has different sample sizes, which causes many recompilations and compiled graph (NEFF) reloads. Furthermore, different data parallel workers can execute different compiled graph. This multiple-program multiple-data behavior is currently unsupported. To workaround this issue, please pad to maximum length using the Trainer API option ``--pad_to_max_length``.
235
235
- When running HuggingFace GPT fine-tuning with transformers version >= 4.21.0 and using XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, you might see NaNs in the loss immediately at the first step. This issue occurs due to large negative constants used to implement attention masking (https://github.com/huggingface/transformers/pull/17306). To workaround this issue, please use transformers version <= 4.20.0.
236
236
- When using Trainer API option --bf16, you will see "RuntimeError: No CUDA GPUs are available". To workaround this error, please add "import torch; torch.cuda.is_bf16_supported = lambda: True" to the Python script (i.e. run_glue.py). (Trainer API option --fp16 is not yet supported).
237
+
- When using latest HuggingFace transformers version, you may see "ValueError: Your setup doesn't support bf16/gpu." To fix this, please use ``--use_cpu True`` in your scripts.
Copy file name to clipboardExpand all lines: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_converted_checkpoint_training.sh
Copy file name to clipboardExpand all lines: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_multi_worker_training_code.sh
Copy file name to clipboardExpand all lines: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_setup_code.sh
Copy file name to clipboardExpand all lines: frameworks/torch/torch-neuronx/tutorials/training/tutorial_source_code/bert_mrpc_finetuning/bert_mrpc_finetuning_single_worker_training.sh
Copy file name to clipboardExpand all lines: general/appnotes/torch-neuronx/introducing-pytorch-2-6.rst
+12-11Lines changed: 12 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,7 +41,7 @@ See :ref:`migrate_to_pytorch_2.6` for changes needed to use PyTorch NeuronX 2.6.
41
41
How can I install PyTorch NeuronX 2.6?
42
42
--------------------------------------------
43
43
44
-
To install PyTorch NeuronX 2.6 please follow the :ref:`setup-torch-neuronx` guides for Amazon Linux 2023 and Ubuntu 22 AMI. Please also refer to the Neuron multi-framework DLAMI :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>` for Ubuntu 22 with a pre-installed virtual environment for PyTorch NeuronX 2.6 that you can use to easily get started. PyTorch NeuronX 2.6 can be installed using the following:
44
+
To install PyTorch NeuronX 2.6 please follow the :ref:`setup-torch-neuronx` guides for Amazon Linux 2023 and Ubuntu 22 AMI. Please also refer to the Neuron multi-framework DLAMI :ref:`setup guide <setup-ubuntu22-multi-framework-dlami>` for Ubuntu 22 with a pre-installed virtual environment for PyTorch NeuronX 2.6 that you can use to get started. PyTorch NeuronX 2.6 can be installed using the following:
45
45
46
46
.. code::
47
47
@@ -66,7 +66,7 @@ To migrate the training scripts from PyTorch NeuronX 2.5 to PyTorch NeuronX 2.6,
66
66
67
67
.. note::
68
68
69
-
``xm`` below refers to ``torch_xla.core.xla_model`` and ``xr`` refers to ``torch_xla.runtime``
69
+
``xm`` below refers to ``torch_xla.core.xla_model``, ``xr`` refers to ``torch_xla.runtime``, and ``xmp`` refers to ``torch_xla.distributed.xla_multiprocessing``
70
70
71
71
* The environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warning when used) and will be removed in an upcoming release. Please switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to convert model to BF16 format. (see :ref:`migration_from_xla_downcast_bf16`)
72
72
* The functions ``xm.xrt_world_size()``, ``xm.xla_model.get_ordinal()``, and ``xm.xla_model.get_local_ordinal()`` are deprecated (warning when used). Please switch to ``xr.world_size()``, ``xr.global_ordinal()``, and ``xr.local_ordinal()`` respectively as replacements.
@@ -123,20 +123,21 @@ Warning "XLA_DOWNCAST_BF16 will be deprecated after the 2.6 release, please down
123
123
Environment variables ``XLA_DOWNCAST_BF16`` and ``XLA_USE_BF16`` are deprecated (warning when used). Please switch to automatic mixed-precision or use ``model.to(torch.bfloat16)`` command to cast model to BF16. (see :ref:`migration_from_xla_downcast_bf16`)
124
124
125
125
126
-
AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'
This is a warning that ``torch_xla.core.xla_model.xrt_world_size()`` will be removed in a future release. Please switch to using ``torch_xla.runtime.world_size`` instead.
128
130
129
-
This is an error that ``torch_xla.core.xla_model.xrt_world_size()`` is removed in torch-xla version 2.7. Please switch to using ``torch_xla.runtime.world_size()`` instead.
130
131
131
-
AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'
This is an error that ``torch_xla.core.xla_model.xla_model.get_ordinal()`` is removed in torch-xla version 2.7. Please switch to using ``torch_xla.runtime.global_ordinal()`` instead.
135
+
This is a warning that ``torch_xla.core.xla_model.xla_model.get_ordinal()`` will be removed in a future release. Please switch to using ``torch_xla.runtime.global_ordinal`` instead.
135
136
136
-
AttributeError: <module 'torch_xla.core.xla_model' ... does not have the attribute 'get_local_ordinal'
WARNING:torch_xla.core.xla_model.xla_model.get_local_ordinal() will be removed in release 2.7. is deprecated. Use torch_xla.runtime.local_ordinal instead.
This is an error that ``torch_xla.core.xla_model.xla_model.get_local_ordinal()`` is removed in torch-xla version 2.7. Please switch to using ``torch_xla.runtime.local_ordinal()`` instead.
140
+
This is a warning that ``torch_xla.core.xla_model.xla_model.get_local_ordinal()`` will be removed in a future release. Please switch to using ``torch_xla.runtime.local_ordinal`` instead.
0 commit comments