Skip to content

Conversation

eric-higgins-ai
Copy link
Contributor

@eric-higgins-ai eric-higgins-ai commented Sep 11, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

It seems like Kueue assumes that when a workload becomes suspended it will shut down and then become inactive. For RayJobs, this means .spec.suspend is set to true, the KubeRay operator deletes the cluster, sets .status.jobDeploymentStatus to Suspended, and then Kueue removes the reservation for the workload. However, when using MultiKueue, the KubeRay operator isn't in the loop in the manager cluster. This means the workload never becomes inactive after being preempted (given the definition of IsActive here) and so continues reserving quota in the cluster.

The same thing happens in any place where the workload is evicted in the manager cluster - the ones I'm aware of are preemption and waitForPodsReady, but there may be others.

Special notes for your reviewer:

I didn't add unit tests yet because I'm not very confident in this approach. I just wanted to check if this is the right approach and then I'll add unit tests

Does this PR introduce a user-facing change?

Fixed preemption and waitForPodsReady for RayJobs when using MultiKueue.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 11, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: eric-higgins-ai
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 11, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @eric-higgins-ai. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 11, 2025
Copy link

netlify bot commented Sep 11, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 4409273
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68c2573ad4d7240008044400

@mbobrovskyi
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 11, 2025
@mbobrovskyi
Copy link
Contributor

/cc @mszadkow

@eric-higgins-ai eric-higgins-ai force-pushed the fix-multikueue-manager-preemption branch from a6f433a to c5b53ea Compare September 11, 2025 02:21
@eric-higgins-ai
Copy link
Contributor Author

/retest

@eric-higgins-ai
Copy link
Contributor Author

/retest

@mimowo
Copy link
Contributor

mimowo commented Sep 11, 2025

Eric, so iiiuc the issue is with WaitForPodsReady enablement on the management cluster? I think this is something we likely didn't take into consideration, so likely a bug.

Before we dive into the fix, what is the current behaviour, is the waitForPodsReady ignored or it evicts, but the workload ends in a bad state?

Also is it a problem if the workload was already admitted to one of the worker clusters, or also when it stays pending on all clusters.

However, this seems to go beyond just RayJob and so I would prefer to look for a more generic solution.

@eric-higgins-ai
Copy link
Contributor Author

@mimowo the problem arises when the manager cluster evicts a workload. When I wrote the PR I thought this could happen with either preemption or waitForPodsReady (there are probably other cases but these are the ones we hit in our usage of Kueue), but I just realized it's not actually correct to specify waitForPodsReady on the manager cluster. It's possible that the manager cluster admits a workload and dispatches it to a worker cluster, but the worker doesn't have sufficient quota so the workload remains in the queue there. If waitForPodsReady is set on the manager then the workload will be returned to the queue after the timeout, which is undesirable for us.

However, the problem this PR is intended to fix also arises when the manager preempts a workload, so it's still useful I think. In this case, the preempted workload gets the Preempted and Evicted conditions, and .spec.suspend gets set on the RayJob. It gets stuck at this point though - QuotaReserved is true, the workload is never deleted on the worker cluster, and the "preemptor" workload never gets admitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants