Remove reliance on nvidia.com/gpu.clique label #276

klueska · 2025-03-11T04:15:46Z

Previously we were relying on the existence of the nvidia.com/clique label to be applied on a node before a ComputeDomain could successfully launch an IMEX daemon on a node. However, it's possible that a ComputeDomain might be created before GFD comes online to apply the label. Independently, it's a bad idea to be relying on a label to get the cliqueID on a remote node, when the information is readily available on each node where an IMEX daemon is being started.

This PR removes our reliance on the nvidia.com/clique label to set the Status.Nodes field of a ComputeDomain. We now update the Status.Nodes field in a distributed fashion, directly from each node being added to the ComputeDomain. This logic is actually simpler, and removes the need to track and react to modifications on each of the DaemonSet pods from the controller.

This change was verified on both a 4-node GH200 cluster as well as a 2-node GB200 cluster, following the procedure outlined here: #249

copy-pr-bot · 2025-03-11T04:15:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

klueska · 2025-03-11T04:18:08Z

/ok to test

Copilot

PR Overview

This pull request removes the dependency on the "nvidia.com/gpu.clique" label for updating the ComputeDomain node status and simplifies the related control logic.

Removed the constant and references to the clique label from the kubelet plugin code.
Eliminated the pod manager logic from the controller and cleaned up the daemonset-related lifecycle management.
Updated ComputeDomain status updates to be performed locally on each node via new helper functions.

Reviewed Changes

File	Description
cmd/compute-domain-kubelet-plugin/device_state.go	Removed clique label constant and added direct node status updates.
cmd/compute-domain-controller/daemonset.go	Removed pod manager variables and related functions.
cmd/compute-domain-kubelet-plugin/computedomain.go	Added new functions to update ComputeDomain node status and retrieve node info using m.cliqueID.
cmd/compute-domain-controller/daemonsetpods.go	Removed the entire DaemonSetPodManager implementation.

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

cmd/compute-domain-kubelet-plugin/computedomain.go

ArangoGutierrez

tested on a GH200 system

jgehrcke · 2025-03-11T10:21:37Z

might be created before GFD comes online to apply the label

GFD? I searched the web. GPU Feature Discovery.

jgehrcke · 2025-03-11T10:25:09Z

cmd/compute-domain-kubelet-plugin/computedomain.go

+
+	var ipAddress string
+	for _, addr := range node.Status.Addresses {
+		if addr.Type == corev1.NodeInternalIP {


Is this the common/established choice to make to filter for the "canonical" IP address?

Is there always one address of type corev1.NodeInternalIP?

yes, each node object has one address

❯ k describe no Name: ip-10-0-0-xxx Roles: control-plane,worker ... Addresses: InternalIP: 10.0.0.xxx Hostname: ip-10-0-0-xxx ...

I would like to actually update this to use the non-host IP at some point, but this untested and will need to wait for a future release.

jgehrcke · 2025-03-11T10:31:09Z

cmd/compute-domain-kubelet-plugin/computedomain.go

+	n := &nvapi.ComputeDomainNode{
+		Name:      nodeName,
+		IPAddress: ipAddress,
+		CliqueID:  m.cliqueID,


Note to self: we set m.cliqueID (string) upon calling NewComputeDomainManager(). That is probably populated straight from source of truth (code not shown in diff).

yes, this was already available for other reasons

jgehrcke

Thank you!

when the information is readily available on each node where an IMEX daemon is being started

Thanks for that argument. It's really the single source of truth concept. The label is an indirection, adding complexity and new failure modes.

This logic is actually simpler

More robust as of simplicity. Lovely.

elezar · 2025-03-11T13:40:25Z

cmd/compute-domain-kubelet-plugin/computedomain.go

+
 	var ips []string
 	for _, node := range cd.Status.Nodes {
 		if m.cliqueID == node.CliqueID {


Under which condiitions would we expect a different CliqueID here? It seems that we're only setting this value once for the nodes.

When the compute domain spans nodes that have GPUs from different cliques. When this happens, we only want to include the node IPs in the node_config.cfg from nodes that have GPUs with the same cliqueID as the current node.

elezar · 2025-03-11T13:42:31Z

cmd/compute-domain-kubelet-plugin/device_state.go

 		return nil, fmt.Errorf("only expected 1 device for requests '%v' in claim '%v'", requests, claim.UID)
 	}

+	// Add info about this node to the ComputeDomain status.


nit: The comment and the method name don't seem to quite align. I would expect something like AddNodeStatusToComputeDomain to better indicate intent.

I don't feel strongly -- made the change

Signed-off-by: Kevin Klues <[email protected]>

klueska self-assigned this Mar 11, 2025

klueska requested review from ArangoGutierrez, cdesiniotis, elezar and jgehrcke March 11, 2025 04:16

klueska added this to the v25.3.0 milestone Mar 11, 2025

ArangoGutierrez requested a review from Copilot March 11, 2025 09:26

Copilot AI reviewed Mar 11, 2025

View reviewed changes

cmd/compute-domain-kubelet-plugin/computedomain.go Outdated Show resolved Hide resolved

ArangoGutierrez approved these changes Mar 11, 2025

View reviewed changes

jgehrcke reviewed Mar 11, 2025

View reviewed changes

jgehrcke approved these changes Mar 11, 2025

View reviewed changes

elezar reviewed Mar 11, 2025

View reviewed changes

Remove reliance on nvidia.com/gpu.clique label

03708e7

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the remove-reliance-on-label branch from 667bdbe to 03708e7 Compare March 11, 2025 14:32

ArangoGutierrez approved these changes Mar 11, 2025

View reviewed changes

klueska merged commit 8d34f97 into NVIDIA:main Mar 11, 2025
7 checks passed

klueska added this to Planning Board: k8s-dra-driver-gpu Jun 16, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Jun 16, 2025

klueska deleted the remove-reliance-on-label branch August 20, 2025 21:27

Remove reliance on nvidia.com/gpu.clique label #276

Remove reliance on nvidia.com/gpu.clique label #276

Uh oh!

Conversation

klueska commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Mar 11, 2025

Uh oh!

klueska commented Mar 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Uh oh!

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

jgehrcke commented Mar 11, 2025

Uh oh!

jgehrcke Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

klueska commented Mar 11, 2025 •

edited

Loading

jgehrcke Mar 11, 2025 •

edited

Loading

jgehrcke left a comment •

edited

Loading