-
Notifications
You must be signed in to change notification settings - Fork 147
Open
Description
According to this user guide: https://docs.nvidia.com/nemo-framework/user-guide/latest/launcherguide/launchertutorial/multirun.html
python3 main.py -m \
stages=[training] \
training.trainer.num_nodes=6 \
training.run.name="5b_6nodes_tp_\${training.model.tensor_model_parallel_size}" \
training.model.tensor_model_parallel_size=1,2,4,8
However, only the first Helm chart can be deployed successfully, because the ConfigMap has a conflict.https://github.com/NVIDIA/NeMo-Megatron-Launcher/blob/f336f483bd9af73c4c665d91654100fa3b0bf0a1/launcher_scripts/nemo_launcher/core/k8s_templates/training/training-config.yaml#L4
Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ConfigMap "training-config" in namespace "default" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "5b-6nodes-tp-2": current value is "5b-6nodes-tp-1"
Metadata
Metadata
Assignees
Labels
No labels