Skip to content

Failed to execute a multirun with different configurations in K8S #238

@Syulin7

Description

@Syulin7

According to this user guide: https://docs.nvidia.com/nemo-framework/user-guide/latest/launcherguide/launchertutorial/multirun.html

python3 main.py -m \
    stages=[training] \
    training.trainer.num_nodes=6 \
    training.run.name="5b_6nodes_tp_\${training.model.tensor_model_parallel_size}" \
    training.model.tensor_model_parallel_size=1,2,4,8

However, only the first Helm chart can be deployed successfully, because the ConfigMap has a conflict.https://github.com/NVIDIA/NeMo-Megatron-Launcher/blob/f336f483bd9af73c4c665d91654100fa3b0bf0a1/launcher_scripts/nemo_launcher/core/k8s_templates/training/training-config.yaml#L4

Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ConfigMap "training-config" in namespace "default" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "5b-6nodes-tp-2": current value is "5b-6nodes-tp-1"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions