Skip to content

Conversation

linoyaslan
Copy link
Contributor

@linoyaslan linoyaslan commented Aug 26, 2025

Adds a configurable GPU weight parameter to the host sorting algorithm to influence automatic role assignment. Hosts with GPUs are now deprioritized for master role assignment, making them more likely to be assigned as workers for specialized GPU workloads.

Assisted-by: Cursor

/cc @danmanor @pastequo

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 26, 2025

@linoyaslan: This pull request references MGMT-20239 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

WIP

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 26, 2025
@linoyaslan linoyaslan marked this pull request as draft August 26, 2025 11:31
@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 26, 2025
@openshift-ci openshift-ci bot requested review from CrystalChun and gamli75 August 26, 2025 11:32
Copy link

openshift-ci bot commented Aug 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: linoyaslan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2025
@linoyaslan linoyaslan force-pushed the auto-assign-worker-role-to-host-with-gpu branch 2 times, most recently from 3287f82 to 140864f Compare August 27, 2025 12:28
@linoyaslan linoyaslan changed the title MGMT-20239: Auto-assign logic should assign worker role to host with GPU MGMT-20239: Add GPU weight parameter to prioritize GPU hosts for worker role assignment Aug 27, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 27, 2025

@linoyaslan: This pull request references MGMT-20239 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Adds a configurable GPU weight parameter to the host sorting algorithm to influence automatic role assignment. Hosts with GPUs are now deprioritized for master role assignment, making them more likely to be assigned as workers for specialized GPU workloads.

/cc @danmanor @pastequo

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@linoyaslan linoyaslan marked this pull request as ready for review August 27, 2025 12:31
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 27, 2025

@linoyaslan: This pull request references MGMT-20239 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Adds a configurable GPU weight parameter to the host sorting algorithm to influence automatic role assignment. Hosts with GPUs are now deprioritized for master role assignment, making them more likely to be assigned as workers for specialized GPU workloads.

Assisted-by: Cursor

/cc @danmanor @pastequo

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@linoyaslan
Copy link
Contributor Author

/cc @danmanor @pastequo

@openshift-ci openshift-ci bot requested review from danmanor and pastequo August 27, 2025 13:36
@linoyaslan linoyaslan force-pushed the auto-assign-worker-role-to-host-with-gpu branch 2 times, most recently from ff7e248 to f423be4 Compare August 27, 2025 13:43
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.68%. Comparing base (5ad96ae) to head (277eb65).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
cmd/main.go 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #7957   +/-   ##
=======================================
  Coverage   73.68%   73.68%           
=======================================
  Files         400      400           
  Lines       68565    68574    +9     
=======================================
+ Hits        50520    50531   +11     
+ Misses      15336    15335    -1     
+ Partials     2709     2708    -1     
Files with missing lines Coverage Δ
internal/bminventory/inventory.go 71.50% <100.00%> (+<0.01%) ⬆️
internal/common/test_configuration.go 0.00% <ø> (ø)
internal/host/host.go 73.25% <100.00%> (+0.02%) ⬆️
internal/host/monitor.go 83.42% <100.00%> (+0.66%) ⬆️
cmd/main.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@linoyaslan linoyaslan force-pushed the auto-assign-worker-role-to-host-with-gpu branch 2 times, most recently from ebef2f9 to ce1c3a0 Compare August 28, 2025 08:32
…er role assignment

Adds a configurable GPU weight parameter to the host sorting algorithm to influence automatic role assignment. Hosts with GPUs are now deprioritized for master role assignment, making them more likely to be assigned as workers for specialized GPU workloads.
@linoyaslan linoyaslan force-pushed the auto-assign-worker-role-to-host-with-gpu branch from ce1c3a0 to 277eb65 Compare August 28, 2025 09:30
@@ -66,7 +70,7 @@ func (m *Manager) initMonitoringQueryGenerator() {
}
}

func SortHosts(hosts []*models.Host) ([]*models.Host, bool) {
func SortHosts(hosts []*models.Host, GPUWeight float64) ([]*models.Host, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about creating 2 lists of hosts, those with GPU, those without GPU ; sort those 2 lists like we use to do, and then concatenate them ?
You would remove the need of a GPU weight

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I like your approach. I’ll try it out!

@@ -66,7 +70,7 @@ func (m *Manager) initMonitoringQueryGenerator() {
}
}

func SortHosts(hosts []*models.Host) ([]*models.Host, bool) {
func SortHosts(hosts []*models.Host, GPUWeight float64) ([]*models.Host, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I find the assign role mechanism a bit weird, especially the implicit dependency between SortHosts and selectRole (IIUC selectRole implicitly relies on the fact that SortHosts is sorting host from the more likely to be a master to the less likely to be a master).

I understand that you are dealing with what exists but I feel like the role assignment logic is split into 2 methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reason is that we first need to assign masters to achieve the number of expected masters, and they don’t have to be the “strongest", they just need to have sufficient CPU and memory, and that’s enough. Why do you find that weird? I think the approach makes sense, but if you believe it should be changed, I’d suggest handling that separately rather than as part of this PR.

Copy link

openshift-ci bot commented Aug 28, 2025

@linoyaslan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-metal-assisted-lvm-4-20 277eb65 link true /test edge-e2e-metal-assisted-lvm-4-20
ci/prow/edge-images 277eb65 link true /test edge-images
ci/prow/edge-e2e-metal-assisted-virtualization-4-19 277eb65 link true /test edge-e2e-metal-assisted-virtualization-4-19

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@pastequo
Copy link
Contributor

pastequo commented Sep 1, 2025

For your konflux failures, if you rebase, it should be good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants