Skip to content

Commit 6724749

Browse files
Merge pull request #2734 from AI-Hypercomputer:sft_doc_fix
PiperOrigin-RevId: 834995598
2 parents 1453a16 + 8f86172 commit 6724749

File tree

2 files changed

+9
-14
lines changed

2 files changed

+9
-14
lines changed

docs/tutorials/sft.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,13 +84,13 @@ export PRE_TRAINED_MODEL_CKPT_PATH=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/maxtext-
8484
2. **Run the Conversion Script:** Execute the following command that downloads the specified Hugging Face model and converts its weights into the MaxText format. The conversion script only supports official versions of models from Hugging Face. To see the specific models and versions currently supported for conversion, please refer to the `HF_IDS` dictionary in the MaxText utility file [here](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_conversion/utils/utils.py).
8585

8686
```sh
87-
pip install torch # Ensure torch is installed for the conversion script
87+
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu # Ensure torch is installed for the conversion script
8888

8989
python3 -m MaxText.utils.ckpt_conversion.to_maxtext src/MaxText/configs/base.yml \
9090
model_name=${PRE_TRAINED_MODEL} \
9191
hf_access_token=${HF_TOKEN} \
9292
base_output_directory=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/maxtext-checkpoint \
93-
scan_layers=True
93+
scan_layers=True skip_jax_distributed_system=True
9494
```
9595

9696
## Run SFT on Hugging Face Dataset

docs/tutorials/sft_on_multi_host.md

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ The `docker_upload_runner.sh` script uploads your Docker image to Artifact Regis
5050
Install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#installation-via-pip).
5151

5252
## 3. Create GKE cluster
53-
Use a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster)
53+
Use a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster).
5454

5555
## 4. Environment configuration
5656
```bash
@@ -60,7 +60,7 @@ export CLUSTER_NAME=<Name of GKE Cluster>
6060
export ZONE=<GKE Cluster Zone>
6161

6262
# -- Workload Configuration --
63-
export WORKLOAD_NAME=<Name of Workload> # e.g., $(date +%Y-%m-%d-%H-%M-%S)
63+
export WORKLOAD_NAME=<Name of Workload> # e.g., sft-$(date +%s)
6464
export TPU_TYPE=<TPU Type> # e.g., v6e-256
6565
export TPU_SLICE=1
6666
export DOCKER_IMAGE="gcr.io/${PROJECT}/${DOCKER_IMAGE_NAME}"
@@ -102,21 +102,16 @@ If your model checkpoint is from Hugging Face, you need to run a conversion scri
102102
export MODEL_CHECKPOINT_PATH=${OUTPUT_PATH}/${WORKLOAD_NAME}/maxtext-checkpoint/0/items
103103
```
104104

105-
2. **Run the Conversion Script:** Execute the following command that downloads the specified Hugging Face model and converts its weights into the MaxText format. The conversion script only supports official versions of models from Hugging Face. To see the specific models and versions currently supported for conversion, please refer to the `HF_IDS` dictionary in the MaxText utility file [here](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_conversion/utils/utils.py).
105+
2. **Run the Conversion Script:** Execute the following commands on a CPU machine that downloads the specified HuggingFace model and converts its weights into the MaxText format. This command will download the HuggingFace model and convert it to the MaxText format, saving it to the specified GCS bucket. The conversion script only supports official versions of models from HuggingFace. To see the specific models and versions currently supported for conversion, please refer to the `HF_IDS` dictionary in the MaxText utility file [here](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_conversion/utils/utils.py).
106106

107107
```bash
108108
USE_ZARR3=<Flag to use zarr3> # True to run SFT with McJAX, False to run SFT with Pathways
109109
USE_OCDBT=<Flag to use ocdbt> # True to run SFT with McJAX, False to run SFT with Pathways
110110

111-
xpk workload create \
112-
--cluster=${CLUSTER_NAME} \
113-
--project=${PROJECT} \
114-
--zone=${ZONE} \
115-
--docker-image=${DOCKER_IMAGE} \
116-
--workload=ckpt-${WORKLOAD_NAME} \
117-
--tpu-type=${TPU_TYPE} \
118-
--num-slices=${TPU_SLICE} \
119-
--command "python3 -m MaxText.utils.ckpt_conversion.to_maxtext src/MaxText/configs/base.yml model_name=$MODEL_NAME hf_access_token=$HF_TOKEN base_output_directory=$OUTPUT_PATH/$WORKLOAD_NAME/maxtext-checkpoint scan_layers=True checkpoint_storage_use_zarr3=$USE_ZARR3 checkpoint_storage_use_ocdbt=$USE_OCDBT"
111+
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
112+
113+
# For large models, it is recommended to set `--lazy_load_tensors` flag to reduce memory usage during conversion
114+
python3 -m MaxText.utils.ckpt_conversion.to_maxtext src/MaxText/configs/base.yml model_name=$MODEL_NAME hf_access_token=$HF_TOKEN base_output_directory=$OUTPUT_PATH/$WORKLOAD_NAME/maxtext-checkpoint scan_layers=True checkpoint_storage_use_zarr3=$USE_ZARR3 checkpoint_storage_use_ocdbt=$USE_OCDBT skip_jax_distributed_system=True --lazy_load_tensors=True
120115
```
121116

122117
## 6. Submit workload on GKE cluster

0 commit comments

Comments
 (0)