Skip to content

Volcano integration doesn't account for resourcesPerNode override in the trainjob #2980

@AndEsterson

Description

@AndEsterson

What happened?

following the docs we can set a TrainingRuntime to create a Volcano PodGroup for each TrainJob instance.

This works correctly, except in the case where a user has specified spec.trainer.resourcePerNode on the TrainJob, in this case the PodGroup has spec.minResources corresponding to the base trainingRuntime resources, rather than the specific TrainJob.

(Might be worth noting that the PodGroup scales correctly when specifying spec.trainer.num_nodes)

What did you expect to happen?

When a TrainingRuntime is configured to use Volcano, then submitting a TrainJob with spec.trainer.resourcePerNode specified, the corresponding PodGroups should scale their resources with spec.trainer.resourcePerNode, rather than with the base TrainingRuntime

Environment

Kubernetes version:

$ kubectl version
v1.34.1+k3s1

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=kubeflow-trainer -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/trainer/trainer-controller-manager:v2.1.0

Kubeflow Python SDK version:

$ pip show kubeflow
Name: kubeflow
Version: 0.1.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions