-
Notifications
You must be signed in to change notification settings - Fork 99
Remove reliance on nvidia.com/gpu.clique label #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This pull request removes the dependency on the "nvidia.com/gpu.clique" label for updating the ComputeDomain node status and simplifies the related control logic.
- Removed the constant and references to the clique label from the kubelet plugin code.
- Eliminated the pod manager logic from the controller and cleaned up the daemonset-related lifecycle management.
- Updated ComputeDomain status updates to be performed locally on each node via new helper functions.
Reviewed Changes
| File | Description |
|---|---|
| cmd/compute-domain-kubelet-plugin/device_state.go | Removed clique label constant and added direct node status updates. |
| cmd/compute-domain-controller/daemonset.go | Removed pod manager variables and related functions. |
| cmd/compute-domain-kubelet-plugin/computedomain.go | Added new functions to update ComputeDomain node status and retrieve node info using m.cliqueID. |
| cmd/compute-domain-controller/daemonsetpods.go | Removed the entire DaemonSetPodManager implementation. |
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested on a GH200 system
GFD? I searched the web. GPU Feature Discovery. |
|
|
||
| var ipAddress string | ||
| for _, addr := range node.Status.Addresses { | ||
| if addr.Type == corev1.NodeInternalIP { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the common/established choice to make to filter for the "canonical" IP address?
Is there always one address of type corev1.NodeInternalIP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, each node object has one address
❯ k describe no
Name: ip-10-0-0-xxx
Roles: control-plane,worker
...
Addresses:
InternalIP: 10.0.0.xxx
Hostname: ip-10-0-0-xxx
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to actually update this to use the non-host IP at some point, but this untested and will need to wait for a future release.
| n := &nvapi.ComputeDomainNode{ | ||
| Name: nodeName, | ||
| IPAddress: ipAddress, | ||
| CliqueID: m.cliqueID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: we set m.cliqueID (string) upon calling NewComputeDomainManager(). That is probably populated straight from source of truth (code not shown in diff).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this was already available for other reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
when the information is readily available on each node where an IMEX daemon is being started
Thanks for that argument. It's really the single source of truth concept. The label is an indirection, adding complexity and new failure modes.
This logic is actually simpler
More robust as of simplicity. Lovely.
|
|
||
| var ips []string | ||
| for _, node := range cd.Status.Nodes { | ||
| if m.cliqueID == node.CliqueID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which condiitions would we expect a different CliqueID here? It seems that we're only setting this value once for the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the compute domain spans nodes that have GPUs from different cliques. When this happens, we only want to include the node IPs in the node_config.cfg from nodes that have GPUs with the same cliqueID as the current node.
| return nil, fmt.Errorf("only expected 1 device for requests '%v' in claim '%v'", requests, claim.UID) | ||
| } | ||
|
|
||
| // Add info about this node to the ComputeDomain status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The comment and the method name don't seem to quite align. I would expect something like AddNodeStatusToComputeDomain to better indicate intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly -- made the change
Signed-off-by: Kevin Klues <[email protected]>
667bdbe to
03708e7
Compare
Previously we were relying on the existence of the
nvidia.com/cliquelabel to be applied on a node before a ComputeDomain could successfully launch an IMEX daemon on a node. However, it's possible that a ComputeDomain might be created before GFD comes online to apply the label. Independently, it's a bad idea to be relying on a label to get thecliqueIDon a remote node, when the information is readily available on each node where an IMEX daemon is being started.This PR removes our reliance on the
nvidia.com/cliquelabel to set theStatus.Nodesfield of aComputeDomain. We now update theStatus.Nodesfield in a distributed fashion, directly from each node being added to the ComputeDomain. This logic is actually simpler, and removes the need to track and react to modifications on each of theDaemonSetpods from the controller.This change was verified on both a 4-node GH200 cluster as well as a 2-node GB200 cluster, following the procedure outlined here: #249