Skip to content

Commit 800d00f

Browse files
2.19.1 miscellaneous fixes (#929)
Co-authored-by: Kavish Gandhi <[email protected]>
1 parent c6feb18 commit 800d00f

File tree

5 files changed

+6
-7
lines changed

5 files changed

+6
-7
lines changed

frameworks/torch/torch-neuronx/training-troubleshooting.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ Currently, NeuronCache default root directory is /var/tmp which is local to the
171171

172172
.. code:: bash
173173
174-
KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-kaena-training-2-1-e859998e-3035-5df63dab5ce63'
174+
KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-training-2-1-e859998e-3035-5df63dab5ce63'
175175
176176
This is a result of limitations to file locking on NFS. EFS/FSx also exhibit similar limitation. The workaround is to setup separate NeuronCache root directories for each worker instance, such as ``NEURON_CC_FLAGS="--cache_dir=$HOME/neuron_cache/bert/`hostname`"``, where the home directory is shared among worker instances as in ParallelCluster.
177177

libraries/neuronx-distributed/setup/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ You can install the ``neuronx-distributed`` package using the following command:
1212
1313
python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
1414
15-
Make sure the transformers version is set to ``4.26.0``
1615
1716
1817

libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Download the Llama2-7B pre-trained checkpoint from HuggingFace.
7373

7474
.. code:: ipython3
7575
76-
ssh compute1-dy-kaena-training-0-1
76+
ssh compute1-dy-training-0-1
7777
source ~/aws_neuron_venv_pytorch/bin/activate
7878
cd ~/examples/tp_zero1_llama2_7b_hf_finetune_ptl
7979
python3 get_model.py

libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ If you want to pre-train Llama2 7B, run the following steps -
7474
.. code:: ipython3
7575
7676
python3 -m pip install -r requirements.txt
77-
chmod +x tp_zero1_llama2_7b_hf_pretrain.sh
77+
chmod +x tp_zero1_llama2_7B_hf_pretrain.sh
7878
7979
8080
To tokenize the data, we must request the tokenizer from hugging face and meta by following the instructions at the following link: `HuggingFace Llama 3 8B Model <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ .
@@ -105,10 +105,10 @@ Next let’s download and pre-process the dataset:
105105

106106
.. code:: ipython3
107107
108-
cd ~/examples/tp_zero1_llama2_7b_hf_pretrain
108+
cd ~/examples/tp_zero1_llama_hf_pretrain
109109
python3 get_dataset.py --llama-version 3 # change the version number to 2 for Llama-2 models
110110
111-
`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama2_7b_hf_pretrain'. Use `repo_type` argument if needed.``
111+
`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama_hf_pretrain'. Use `repo_type` argument if needed.``
112112
This could be because of a stale cache. Try deleting the cache using:
113113

114114
.. code:: ipython3

neuron-runtime/nrt-troubleshoot.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -597,7 +597,7 @@ Name resolution failure
597597

598598
.. code:: bash
599599
600-
WARN Invalid NCCL_COMM_ID [compute1-st-kaena-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: <ipv4>:<port> or [<ipv6>]:<port>
600+
WARN Invalid NCCL_COMM_ID [compute1-dy-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: <ipv4>:<port> or [<ipv6>]:<port>
601601
602602
.. _solution-11:
603603

0 commit comments

Comments
 (0)