-
Notifications
You must be signed in to change notification settings - Fork 237
MGMT-20239: Add GPU weight parameter to prioritize GPU hosts for worker role assignment #7957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
MGMT-20239: Add GPU weight parameter to prioritize GPU hosts for worker role assignment #7957
Conversation
@linoyaslan: This pull request references MGMT-20239 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: linoyaslan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3287f82
to
140864f
Compare
@linoyaslan: This pull request references MGMT-20239 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@linoyaslan: This pull request references MGMT-20239 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ff7e248
to
f423be4
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7957 +/- ##
=======================================
Coverage 73.68% 73.68%
=======================================
Files 400 400
Lines 68565 68574 +9
=======================================
+ Hits 50520 50531 +11
+ Misses 15336 15335 -1
+ Partials 2709 2708 -1
🚀 New features to boost your workflow:
|
ebef2f9
to
ce1c3a0
Compare
…er role assignment Adds a configurable GPU weight parameter to the host sorting algorithm to influence automatic role assignment. Hosts with GPUs are now deprioritized for master role assignment, making them more likely to be assigned as workers for specialized GPU workloads.
ce1c3a0
to
277eb65
Compare
@@ -66,7 +70,7 @@ func (m *Manager) initMonitoringQueryGenerator() { | |||
} | |||
} | |||
|
|||
func SortHosts(hosts []*models.Host) ([]*models.Host, bool) { | |||
func SortHosts(hosts []*models.Host, GPUWeight float64) ([]*models.Host, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about creating 2 lists of hosts, those with GPU, those without GPU ; sort those 2 lists like we use to do, and then concatenate them ?
You would remove the need of a GPU weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I like your approach. I’ll try it out!
@@ -66,7 +70,7 @@ func (m *Manager) initMonitoringQueryGenerator() { | |||
} | |||
} | |||
|
|||
func SortHosts(hosts []*models.Host) ([]*models.Host, bool) { | |||
func SortHosts(hosts []*models.Host, GPUWeight float64) ([]*models.Host, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit I find the assign role mechanism a bit weird, especially the implicit dependency between SortHosts and selectRole (IIUC selectRole implicitly relies on the fact that SortHosts is sorting host from the more likely to be a master to the less likely to be a master).
I understand that you are dealing with what exists but I feel like the role assignment logic is split into 2 methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the reason is that we first need to assign masters to achieve the number of expected masters, and they don’t have to be the “strongest", they just need to have sufficient CPU and memory, and that’s enough. Why do you find that weird? I think the approach makes sense, but if you believe it should be changed, I’d suggest handling that separately rather than as part of this PR.
@linoyaslan: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
For your konflux failures, if you rebase, it should be good |
Adds a configurable GPU weight parameter to the host sorting algorithm to influence automatic role assignment. Hosts with GPUs are now deprioritized for master role assignment, making them more likely to be assigned as workers for specialized GPU workloads.
Assisted-by: Cursor
/cc @danmanor @pastequo
List all the issues related to this PR
What environments does this code impact?
How was this code tested?
Checklist
docs
, README, etc)Reviewers Checklist