MGMT-20239: Add GPU weight parameter to prioritize GPU hosts for worker role assignment #7957

linoyaslan · 2025-08-26T11:31:41Z

Adds a configurable GPU weight parameter to the host sorting algorithm to influence automatic role assignment. Hosts with GPUs are now deprioritized for master role assignment, making them more likely to be assigned as workers for specialized GPU workloads.

Assisted-by: Cursor

/cc @danmanor @pastequo

List all the issues related to this PR

What environments does this code impact?

Automation (CI, tools, etc)
Cloud
Operator Managed Deployments
None

How was this code tested?

assisted-test-infra environment
dev-scripts environment
Reviewer's test appreciated
Waiting for CI to do a full test run
Manual (Elaborate on how it was tested)
No tests needed

Checklist

Title and description added to both, commit and PR.
Relevant issues have been associated (see CONTRIBUTING guide)
This change does not require a documentation update (docstring, docs, README, etc)
Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?
Is there a bug required (and linked) for this change?
Should this PR be backported?

openshift-ci-robot · 2025-08-26T11:31:45Z

openshift-ci · 2025-08-26T11:32:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: linoyaslan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [linoyaslan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-08-27T12:31:41Z

openshift-ci-robot · 2025-08-27T12:32:20Z

linoyaslan · 2025-08-27T13:36:29Z

/cc @danmanor @pastequo

codecov · 2025-08-27T14:55:36Z

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.68%. Comparing base (5ad96ae) to head (277eb65).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
cmd/main.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #7957   +/-   ##
=======================================
  Coverage   73.68%   73.68%           
=======================================
  Files         400      400           
  Lines       68565    68574    +9     
=======================================
+ Hits        50520    50531   +11     
+ Misses      15336    15335    -1     
+ Partials     2709     2708    -1

Files with missing lines	Coverage Δ
internal/bminventory/inventory.go	`71.50% <100.00%> (+<0.01%)`	⬆️
internal/common/test_configuration.go	`0.00% <ø> (ø)`
internal/host/host.go	`73.25% <100.00%> (+0.02%)`	⬆️
internal/host/monitor.go	`83.42% <100.00%> (+0.66%)`	⬆️
cmd/main.go	`0.00% <0.00%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

internal/host/monitor.go

…er role assignment Adds a configurable GPU weight parameter to the host sorting algorithm to influence automatic role assignment. Hosts with GPUs are now deprioritized for master role assignment, making them more likely to be assigned as workers for specialized GPU workloads.

pastequo · 2025-08-28T09:42:25Z

internal/host/monitor.go

@@ -66,7 +70,7 @@ func (m *Manager) initMonitoringQueryGenerator() {
 	}
 }

-func SortHosts(hosts []*models.Host) ([]*models.Host, bool) {
+func SortHosts(hosts []*models.Host, GPUWeight float64) ([]*models.Host, bool) {


What about creating 2 lists of hosts, those with GPU, those without GPU ; sort those 2 lists like we use to do, and then concatenate them ?
You would remove the need of a GPU weight

Thanks, I like your approach. I’ll try it out!

pastequo · 2025-08-28T09:48:48Z

internal/host/monitor.go

@@ -66,7 +70,7 @@ func (m *Manager) initMonitoringQueryGenerator() {
 	}
 }

-func SortHosts(hosts []*models.Host) ([]*models.Host, bool) {
+func SortHosts(hosts []*models.Host, GPUWeight float64) ([]*models.Host, bool) {


I have to admit I find the assign role mechanism a bit weird, especially the implicit dependency between SortHosts and selectRole (IIUC selectRole implicitly relies on the fact that SortHosts is sorting host from the more likely to be a master to the less likely to be a master).

I understand that you are dealing with what exists but I feel like the role assignment logic is split into 2 methods

Yes, the reason is that we first need to assign masters to achieve the number of expected masters, and they don’t have to be the “strongest", they just need to have sufficient CPU and memory, and that’s enough. Why do you find that weird? I think the approach makes sense, but if you believe it should be changed, I’d suggest handling that separately rather than as part of this PR.

openshift-ci · 2025-08-28T14:59:01Z

@linoyaslan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/edge-e2e-metal-assisted-lvm-4-20	`277eb65`	link	true	`/test edge-e2e-metal-assisted-lvm-4-20`
ci/prow/edge-images	`277eb65`	link	true	`/test edge-images`
ci/prow/edge-e2e-metal-assisted-virtualization-4-19	`277eb65`	link	true	`/test edge-e2e-metal-assisted-virtualization-4-19`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pastequo · 2025-09-01T09:00:08Z

For your konflux failures, if you rebase, it should be good

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 26, 2025

linoyaslan marked this pull request as draft August 26, 2025 11:31

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 26, 2025

openshift-ci bot requested review from CrystalChun and gamli75 August 26, 2025 11:32

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2025

linoyaslan force-pushed the auto-assign-worker-role-to-host-with-gpu branch 2 times, most recently from 3287f82 to 140864f Compare August 27, 2025 12:28

linoyaslan changed the title ~~MGMT-20239: Auto-assign logic should assign worker role to host with GPU~~ MGMT-20239: Add GPU weight parameter to prioritize GPU hosts for worker role assignment Aug 27, 2025

linoyaslan marked this pull request as ready for review August 27, 2025 12:31

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2025

openshift-ci bot requested review from danmanor and pastequo August 27, 2025 13:36

linoyaslan force-pushed the auto-assign-worker-role-to-host-with-gpu branch 2 times, most recently from ff7e248 to f423be4 Compare August 27, 2025 13:43

linoyaslan force-pushed the auto-assign-worker-role-to-host-with-gpu branch 2 times, most recently from ebef2f9 to ce1c3a0 Compare August 28, 2025 08:32

pastequo reviewed Aug 28, 2025

View reviewed changes

internal/host/monitor.go Outdated Show resolved Hide resolved

linoyaslan force-pushed the auto-assign-worker-role-to-host-with-gpu branch from ce1c3a0 to 277eb65 Compare August 28, 2025 09:30

pastequo reviewed Aug 28, 2025

View reviewed changes

MGMT-20239: Add GPU weight parameter to prioritize GPU hosts for worker role assignment #7957

Are you sure you want to change the base?

MGMT-20239: Add GPU weight parameter to prioritize GPU hosts for worker role assignment #7957

Uh oh!

Conversation

linoyaslan commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

openshift-ci-robot commented Aug 26, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

openshift-ci bot commented Aug 26, 2025

Uh oh!

openshift-ci-robot commented Aug 27, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

openshift-ci-robot commented Aug 27, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

linoyaslan commented Aug 27, 2025

Uh oh!

codecov bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

pastequo Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

linoyaslan Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

pastequo Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

linoyaslan Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

pastequo commented Sep 1, 2025

Uh oh!

Uh oh!

linoyaslan commented Aug 26, 2025 •

edited

Loading

openshift-ci-robot commented Aug 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 27, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 27, 2025 •

edited by openshift-ci bot

Loading

codecov bot commented Aug 27, 2025 •

edited

Loading