Skip to content

Conversation

@klueska
Copy link
Collaborator

@klueska klueska commented Mar 11, 2025

Previously we were relying on the existence of the nvidia.com/clique label to be applied on a node before a ComputeDomain could successfully launch an IMEX daemon on a node. However, it's possible that a ComputeDomain might be created before GFD comes online to apply the label. Independently, it's a bad idea to be relying on a label to get the cliqueID on a remote node, when the information is readily available on each node where an IMEX daemon is being started.

This PR removes our reliance on the nvidia.com/clique label to set the Status.Nodes field of a ComputeDomain. We now update the Status.Nodes field in a distributed fashion, directly from each node being added to the ComputeDomain. This logic is actually simpler, and removes the need to track and react to modifications on each of the DaemonSet pods from the controller.

This change was verified on both a 4-node GH200 cluster as well as a 2-node GB200 cluster, following the procedure outlined here: #249

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@klueska klueska self-assigned this Mar 11, 2025
@klueska klueska added this to the v25.3.0 milestone Mar 11, 2025
@klueska
Copy link
Collaborator Author

klueska commented Mar 11, 2025

/ok to test

@ArangoGutierrez ArangoGutierrez requested a review from Copilot March 11, 2025 09:26
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request removes the dependency on the "nvidia.com/gpu.clique" label for updating the ComputeDomain node status and simplifies the related control logic.

  • Removed the constant and references to the clique label from the kubelet plugin code.
  • Eliminated the pod manager logic from the controller and cleaned up the daemonset-related lifecycle management.
  • Updated ComputeDomain status updates to be performed locally on each node via new helper functions.

Reviewed Changes

File Description
cmd/compute-domain-kubelet-plugin/device_state.go Removed clique label constant and added direct node status updates.
cmd/compute-domain-controller/daemonset.go Removed pod manager variables and related functions.
cmd/compute-domain-kubelet-plugin/computedomain.go Added new functions to update ComputeDomain node status and retrieve node info using m.cliqueID.
cmd/compute-domain-controller/daemonsetpods.go Removed the entire DaemonSetPodManager implementation.

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested on a GH200 system

@jgehrcke
Copy link
Collaborator

might be created before GFD comes online to apply the label

GFD? I searched the web. GPU Feature Discovery.


var ipAddress string
for _, addr := range node.Status.Addresses {
if addr.Type == corev1.NodeInternalIP {
Copy link
Collaborator

@jgehrcke jgehrcke Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the common/established choice to make to filter for the "canonical" IP address?

Is there always one address of type corev1.NodeInternalIP?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, each node object has one address

❯ k describe no
Name:               ip-10-0-0-xxx
Roles:              control-plane,worker
...
Addresses:
  InternalIP:  10.0.0.xxx
  Hostname:    ip-10-0-0-xxx
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to actually update this to use the non-host IP at some point, but this untested and will need to wait for a future release.

n := &nvapi.ComputeDomainNode{
Name: nodeName,
IPAddress: ipAddress,
CliqueID: m.cliqueID,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: we set m.cliqueID (string) upon calling NewComputeDomainManager(). That is probably populated straight from source of truth (code not shown in diff).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was already available for other reasons

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

when the information is readily available on each node where an IMEX daemon is being started

Thanks for that argument. It's really the single source of truth concept. The label is an indirection, adding complexity and new failure modes.

This logic is actually simpler

More robust as of simplicity. Lovely.


var ips []string
for _, node := range cd.Status.Nodes {
if m.cliqueID == node.CliqueID {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under which condiitions would we expect a different CliqueID here? It seems that we're only setting this value once for the nodes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the compute domain spans nodes that have GPUs from different cliques. When this happens, we only want to include the node IPs in the node_config.cfg from nodes that have GPUs with the same cliqueID as the current node.

return nil, fmt.Errorf("only expected 1 device for requests '%v' in claim '%v'", requests, claim.UID)
}

// Add info about this node to the ComputeDomain status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The comment and the method name don't seem to quite align. I would expect something like AddNodeStatusToComputeDomain to better indicate intent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly -- made the change

@klueska klueska force-pushed the remove-reliance-on-label branch from 667bdbe to 03708e7 Compare March 11, 2025 14:32
@klueska klueska merged commit 8d34f97 into NVIDIA:main Mar 11, 2025
7 checks passed
@klueska klueska deleted the remove-reliance-on-label branch August 20, 2025 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

4 participants