Skip to content

Conversation

amy
Copy link
Contributor

@amy amy commented Sep 15, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Related: #6621
Adds CQ weight to determine winner for fairsharing tournament if drs values are the same.
TLDR; for a workload that we're seeing cycling it probably has the same drs value as its preemptor. The cycling workload is also able to consistently preempt a workload that never shows up in a tournament (ie is never an admission candidate).

Here's a preemption cycle we're encountering...

Lets call the preemption cycling workload w-cycle
workload that preempts w-cycle is w-preemptor
workload that w-cycle preempts is w-target. -> w-target has a very low weight CQ. It never shows up as an admission candidate in any tournament.

Here's the cyclical preemption we're getting:

  1. w-cycle preempts w-target
  • Cluster state:
    • w-cycle admitted
    • w-target preempted
  1. w-preemptor preempts w-cycle -> both have drs values 1
  • Cluster state:
    • w-cycle preempted
    • w-preemptor and w-target admitted

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

FairSharing: In case equal DRS values for a pair of workloads, use the weight of the ClusterQueue as a tiebreaker
to determine which workload to admit. The workload from the ClusterQueue with a higher weight has priority.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Sep 15, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 15, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mimowo September 15, 2025 20:07
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 15, 2025
Copy link

netlify bot commented Sep 15, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit b0e4576
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68ccdb82d863530008e4ff51

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 15, 2025
@amy amy changed the title Add cq weight to tiebreakers when deciding tournament winner [WIP] Add cq weight to tiebreakers when deciding tournament winner Sep 15, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 15, 2025
@amy
Copy link
Contributor Author

amy commented Sep 15, 2025

cc/ @PBundyra

@mimowo
Copy link
Contributor

mimowo commented Sep 16, 2025

I would prefer to consider a more generic solution, as described in #6774 (comment)

So, DominantResourceShare() would return, instead of int, some comparable struct, like drsValue with functions "AsInt() int" and "Compare()". The Compare function would use multiplies with high resolution as described in the comment. The "AsInt" function will be used for API surfacing.

@PBundyra
Copy link
Contributor

I would prefer to consider a more generic solution, as described in #6774 (comment)

So, DominantResourceShare() would return, instead of int, some comparable struct, like drsValue with functions "AsInt() int" and "Compare()". The Compare function would use multiplies with high resolution as described in the comment. The "AsInt" function will be used for API surfacing.

This is an orthogonal thing. Here we only focus on what should happen when there is a tie wrt DRS (no matter what resolution/type we use)

@PBundyra
Copy link
Contributor

This was discussed here: #6621 and in case of DRS tie, it's a better fallback than CreationTimestamp

@mimowo
Copy link
Contributor

mimowo commented Sep 16, 2025

I see, but the change in the DRF may mean that this is almost never reached, because https://github.com/kubernetes-sigs/kueue/pull/6846/files#diff-8716acbdd5bd256fcf6e05caf9b5573b819e7480aca9b6154dc91c8942ae73efR170-R173 would be more picky.

It would be good to have a test case which is solved by this, but would not be fixed by increasing DRF resolution itself.

@PBundyra
Copy link
Contributor

PBundyra commented Sep 16, 2025

This code would only be reached if two CQs had different weight yet still DRF (no matter the resoultion) would be the same e.g.
CQ1: weight: 1, borrowed resources: 1
CQ2: weight: 2, borrowed resources: 2

So if there are two Workloads submitted at the same time, one coming from CQ1, and one from CQ2, I think it's fair to say that Kueue scheduler should process the one coming from CQ2 first

I think this case could be rather common as this is perfectly balanced state of a Cohort.

I agree we should have a unit test for that

@mimowo
Copy link
Contributor

mimowo commented Sep 16, 2025

sgtm, thanks for clarifying

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 17, 2025
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 17, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 17, 2025
@amy amy force-pushed the cqWeightOrdering branch 2 times, most recently from d89a5fa to 2dc0524 Compare September 17, 2025 05:33
@amy
Copy link
Contributor Author

amy commented Sep 17, 2025

/retest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify it?

@PBundyra
Copy link
Contributor

Thanks!
/lgtm
cc @gabesaba @mimowo

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: fd6c4f778bb67178944374d50bb52561387c14ef

@mimowo
Copy link
Contributor

mimowo commented Sep 19, 2025

The release note says "Add cq weight to tiebreakers when deciding FS tournament winner" which is the statement about the code change. The release should focus on the user-facing impact instead.

Let me propose, but feel free to adjust if I'm wrong about something:

/release-note-edit

Fix the inifinite preemption cycle between a pair of workloads in the scenario of exactly equal DRS,
but when the weight of the ClusterQueues differ. Now we use the ClusterQueue weight as an additional tiebreaker.

/kind bug
/cherry-pick release-0.13
/cherry-pick release-0.12
/approve

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.12, release-0.13 in new PRs and assign them to you.

In response to this:

The release note says "Add cq weight to tiebreakers when deciding FS tournament winner" which is the statement about the code change. The release should focus on the user-facing impact instead.

Let me propose, but feel free to adjust if I'm wrong about something:

/release-note-edit

Fix the inifinite preemption cycle between a pair of workloads in the scenario of exactly equal DRS,
but when the weight of the ClusterQueues differ. Now we use the ClusterQueue weight as an additional tiebreaker.

/kind bug
/cherry-pick release-0.13
/cherry-pick release-0.12
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amy, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2025
@mimowo
Copy link
Contributor

mimowo commented Sep 19, 2025

/release-note-edit

Fix the FairSharing but, where an inifinite preemption cycle between a pair of workloads was possible in the scenario
of exactly equal DRS, but when the weight of the ClusterQueues differ. Now we use the ClusterQueue weight as an
additional tiebreaker.

@mimowo
Copy link
Contributor

mimowo commented Sep 19, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2025
@mimowo
Copy link
Contributor

mimowo commented Sep 19, 2025

Is it really solving the issue with initinite preemption cycle @amy @PBundyra ?

I'm still puzzled about the user facing impact, because of the @gabesaba proof that the cycle should never happen if we increase resolution.

@mimowo
Copy link
Contributor

mimowo commented Sep 19, 2025

Ok, I have consulted with @PBundyra that the PR description may not be accurately reflecting the change.

I will update the release note according to my current understanding after the consulation.

@mimowo
Copy link
Contributor

mimowo commented Sep 19, 2025

So, after consulting I see some disconnect between the issue description and the code. First, the code does not solve the "inifnite preemption loop" and it only affects the comparison function for scheduling, rather than preemption.

So, IIUC the more accurate release note would be:
/release-note-edit

FairSharing: In case equal DRS values for a pair of workloads, use the weight of the ClusterQueue as a tiebreaker
to determine which workload to admit. The workload from the ClusterQueue with a higher weight has priority.

I think we could say this is a small tweak to the API which is underspecified so we can tweak it.

However, given the disconnect between description and code I'm not totally sure this change is needed or intended.

Let me know @amy or @gabesaba if you have some opinions.

@amy
Copy link
Contributor Author

amy commented Sep 19, 2025

Yeah I think I wrote the description in the earlier phases of understanding this problem and it doesn't reflect the developments upstream.

I would still like to get this merged.

Assume DWS and priority are equivalent for A, B, D -> value 1
C has a low priority:

A -> B : A & C admitted, B preempted
B -> C : B admitted, C preempted

D -> B : D & C admitted, B preempted
B -> C : B admitted, C preempted

E -> B : E & C admitted, B preempted
B -> C : B admitted, C preempted
Repeat


B here is the one that gets preempted/readmitted a lot. So while this isn't about preemption per say, the admission ordering is affecting B's readmission where I would like CQ weight to affect admission ordering.

Precision will definitely improve things. But in the case where we have same DRS and priority due to CQ weight differences, I would still like CQ weight to affect admission decisions. WDYT?

@mimowo
Copy link
Contributor

mimowo commented Sep 22, 2025

Precision will definitely improve things. But in the case where we have same DRS and priority due to CQ weight differences, I would still like CQ weight to affect admission decisions. WDYT?

Right, with the new increase in precision hitting the new code on a large cluster will be very rare I think. Sure, you can have "exactly equal DRS" in unit tests as here, but it seems unlikely to be common on large clusters, so I'm a bit concerned about adding extra code which will be seldom exercised.

Also, the CQ weight is already part of the DRS computations, so adding this extra oomph to CQ outside of the DRS calculation is ad-hoc. I think we could/should update the CompareDRS after Gabe's PR to say DRS1 < DRS2, when all weight1 > weigth2, given all others is equal. Then all uses of Compare would give the oomph consistently.

Still, even with the checked moved to CompareDRS I'm not totally sure this new code would make any practical changes on a large non-unit-test setup.

Maybe it makes sense to test the cluster after Gabe's PR to assess the need for the extra checks, which increase code complexity, and are a bit ad-hoc (IMO).

@mimowo
Copy link
Contributor

mimowo commented Sep 22, 2025

cc @gabesaba @PBundyra in case you have some opinions here.

@gabesaba
Copy link
Contributor

Maybe it makes sense to test the cluster after Gabe's PR to assess the need for the extra checks, which increase code complexity, and are a bit ad-hoc (IMO).

+1. The original goal of this PR was to prevent cycles, right? I would also favor to keep logic simpler if the main problem is solved

@gabesaba
Copy link
Contributor

Maybe it makes sense to test the cluster after Gabe's PR to assess the need for the extra checks, which increase code complexity, and are a bit ad-hoc (IMO).

+1. The original goal of this PR was to prevent cycles, right? I would also favor to keep logic simpler if the main problem is solved

OTOH, @PBundyra made a good case for it here: #6846 (comment). I don't feel so strongly either way.

@amy
Copy link
Contributor Author

amy commented Sep 22, 2025

Looking at cluster weights in one sample large cluster:
kubectl get cq --context=<CLUSTER> -o yaml | grep "weight:" | sort --unique
I get 15 unique cluster weight numbers (which isn't a lot of unique numbers). Our workload sizes are also standardized. So I could still see the potential for DRS collisions.

As an example, here's 2 CQ weights:

weight: "0.0005"  -> borrows 4
weight: "0.001"     -> borrows 8 -> collision

While this is a bit of defensive programming, it doesn't seem like --too-- much complexity.

I think we could/should update the CompareDRS after Gabe's PR to say DRS1 < DRS2, when all weight1 > weigth2, given all others is equal. Then all uses of Compare would give the oomph consistently.

No doing this in CompareDRS is probably not the right place. Then cq weight would precede workload priority in the ordering for admission candidate selection in the tournament.

The logic for admission and preemption seem to work differently.

  • admission: you compare the workload itself. And the first thing you compare is DRS.
  • preemption: you first order the cqs, then compare the drs of the workload plucked off the ordered cqs. So the last thing you compare is DRS.

@PBundyra
Copy link
Contributor

I don't fully get how this would prevent cycles from happening

Yeah I think I wrote the description in the earlier phases of understanding this problem and it doesn't reflect the developments upstream.

I would still like to get this merged.

Assume DWS and priority are equivalent for A, B, D -> value 1 C has a low priority:

A -> B : A & C admitted, B preempted B -> C : B admitted, C preempted

D -> B : D & C admitted, B preempted B -> C : B admitted, C preempted

E -> B : E & C admitted, B preempted B -> C : B admitted, C preempted Repeat

B here is the one that gets preempted/readmitted a lot. So while this isn't about preemption per say, the admission ordering is affecting B's readmission where I would like CQ weight to affect admission ordering.

Precision will definitely improve things. But in the case where we have same DRS and priority due to CQ weight differences, I would still like CQ weight to affect admission decisions. WDYT?

This description is not clear to me.

However given the example I provided earlier, and the one Amy has just provided I support this change regardless of whether it prevents cycles or not. I don't perceive it as too much complexity, just a small nice improvement to the FS scheduling logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants