-
Notifications
You must be signed in to change notification settings - Fork 843
Description
What happened?
following the docs we can set a TrainingRuntime to create a Volcano PodGroup for each TrainJob instance.
This works correctly, except in the case where a user has specified spec.trainer.resourcePerNode on the TrainJob, in this case the PodGroup has spec.minResources corresponding to the base trainingRuntime resources, rather than the specific TrainJob.
(Might be worth noting that the PodGroup scales correctly when specifying spec.trainer.num_nodes)
What did you expect to happen?
When a TrainingRuntime is configured to use Volcano, then submitting a TrainJob with spec.trainer.resourcePerNode specified, the corresponding PodGroups should scale their resources with spec.trainer.resourcePerNode, rather than with the base TrainingRuntime
Environment
Kubernetes version:
$ kubectl version
v1.34.1+k3s1Kubeflow Trainer version:
$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=kubeflow-trainer -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/trainer/trainer-controller-manager:v2.1.0Kubeflow Python SDK version:
$ pip show kubeflow
Name: kubeflow
Version: 0.1.0Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍