Skip to content

Conversation

@klueska
Copy link
Collaborator

@klueska klueska commented Nov 18, 2025

No description provided.

@klueska klueska added this to the v25.12.0 milestone Nov 18, 2025
@klueska klueska self-assigned this Nov 18, 2025
@klueska klueska added bug Issue/PR to expose/discuss/fix a bug backport-25.8 labels Nov 18, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@klueska
Copy link
Collaborator Author

klueska commented Nov 18, 2025

/ok to test ecd1393

newCD.Status.Nodes = updatedNodes
if _, err := m.config.clientsets.Nvidia.ResourceV1beta1().ComputeDomains(newCD.Namespace).UpdateStatus(ctx, newCD, metav1.UpdateOptions{}); err != nil {
newCD, err = m.config.clientsets.Nvidia.ResourceV1beta1().ComputeDomains(newCD.Namespace).UpdateStatus(ctx, newCD, metav1.UpdateOptions{})
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connecting dots: we noticed this before; that's the only unresolved comment on #651:

#651 (comment)

@jgehrcke jgehrcke added the robustness issue/pr: edge cases & fault tolerance label Nov 18, 2025
Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Of course this is what we want, for various reasons.

However, I still need to convince myself if this patch alone fixes a well-understood bug.

I am slowly catching up with the Slack thread that sparked reviewing this code section (and making the change).

Here, we make a change in removeNodeFromComputeDomain(). I wonder how that part of the business logic relates to the scenario in the user's environment. It appears as if that scenario is about cleanly growing a CD over many nodes (and hence node removal from that CD should not be part of that scenario).

In any case, let's get this in. And then worry about classification of this patch later when we have learned/understood more.

@jgehrcke jgehrcke removed the bug Issue/PR to expose/discuss/fix a bug label Nov 20, 2025
@jgehrcke
Copy link
Collaborator

Not sure about the bug this fixes, but let's certainly land this.

@jgehrcke jgehrcke merged commit 65c921d into NVIDIA:main Nov 20, 2025
14 checks passed
@klueska
Copy link
Collaborator Author

klueska commented Nov 24, 2025

/cherry-pick release-25.8

@github-actions
Copy link

🤖 Backport PR created for release-25.8: #740

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.8 cherry-pick/release-25.8 robustness issue/pr: edge cases & fault tolerance

Projects

Development

Successfully merging this pull request may close these issues.

2 participants