2.19.1 miscellaneous fixes (#929)

aws-rxgupta · kvshbg-aws · web-flow · commit 800d00f87d73 · 2024-07-22T10:17:57.000-07:00
Co-authored-by: Kavish Gandhi &lt;kvshbg@amazon.com&gt;
diff --git a/frameworks/torch/torch-neuronx/training-troubleshooting.rst b/frameworks/torch/torch-neuronx/training-troubleshooting.rst
@@ -171,7 +171,7 @@ Currently, NeuronCache default root directory is /var/tmp which is local to the
 
 .. code:: bash
 
-    KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-kaena-training-2-1-e859998e-3035-5df63dab5ce63'
+    KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-training-2-1-e859998e-3035-5df63dab5ce63'
 
 This is a result of limitations to file locking on NFS. EFS/FSx also exhibit similar limitation. The workaround is to setup separate NeuronCache root directories for each worker instance, such as ``NEURON_CC_FLAGS="--cache_dir=$HOME/neuron_cache/bert/`hostname`"``, where the home directory is shared among worker instances as in ParallelCluster.
 
diff --git a/libraries/neuronx-distributed/setup/index.rst b/libraries/neuronx-distributed/setup/index.rst
@@ -12,7 +12,6 @@ You can install the ``neuronx-distributed`` package using the following command:
 
    python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
 
-Make sure the transformers version is set to ``4.26.0``
 
 
 
diff --git a/libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.rst b/libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.rst
@@ -73,7 +73,7 @@ Download the Llama2-7B pre-trained checkpoint from HuggingFace.
 
 .. code:: ipython3
 
-   ssh compute1-dy-kaena-training-0-1
+   ssh compute1-dy-training-0-1
    source ~/aws_neuron_venv_pytorch/bin/activate
    cd ~/examples/tp_zero1_llama2_7b_hf_finetune_ptl
    python3 get_model.py
diff --git a/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst b/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst
@@ -74,7 +74,7 @@ If you want to pre-train Llama2 7B, run the following steps -
 .. code:: ipython3
 
    python3 -m pip install -r requirements.txt
-   chmod +x tp_zero1_llama2_7b_hf_pretrain.sh
+   chmod +x tp_zero1_llama2_7B_hf_pretrain.sh
 
 
 To tokenize the data, we must request the tokenizer from hugging face and meta by following the instructions at the following link: `HuggingFace Llama 3 8B Model <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ . 
@@ -105,10 +105,10 @@ Next let’s download and pre-process the dataset:
 
 .. code:: ipython3
 
-   cd ~/examples/tp_zero1_llama2_7b_hf_pretrain
+   cd ~/examples/tp_zero1_llama_hf_pretrain
    python3 get_dataset.py --llama-version 3  # change the version number to 2 for Llama-2 models
 
-`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama2_7b_hf_pretrain'. Use `repo_type` argument if needed.`` 
+`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama_hf_pretrain'. Use `repo_type` argument if needed.`` 
 This could be because of a stale cache. Try deleting the cache using: 
 
 .. code:: ipython3
diff --git a/neuron-runtime/nrt-troubleshoot.rst b/neuron-runtime/nrt-troubleshoot.rst
@@ -597,7 +597,7 @@ Name resolution failure
 
 .. code:: bash
    
-     WARN Invalid NCCL_COMM_ID [compute1-st-kaena-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: <ipv4>:<port> or [<ipv6>]:<port>
+     WARN Invalid NCCL_COMM_ID [compute1-dy-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: <ipv4>:<port> or [<ipv6>]:<port>
 
 .. _solution-11:
 

Original file line number	Diff line number	Diff line change
@@ -12,7 +12,6 @@ You can install the ``neuronx-distributed`` package using the following command:
`12`	`12`
`13`	`13`	`python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com`
`14`	`14`
`15`		-Make sure the transformers version is set to ``4.26.0``
`16`	`15`
`17`	`16`
`18`	`17`