Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

A patch for #321.

To emit

error getting status of nodes in ComputeDomain: no nodes set yet

instead of

error getting status of nodes in ComputeDomain: %!w(<nil>)

If I am not mistaken, the condition of cd.Status.Nodes == nil is not erroneous, but just a transient state where no node status has been set yet via AddNodeStatusToComputeDomain().

@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


if cd.Status.Nodes == nil {
return nil, fmt.Errorf("error getting status of nodes in ComputeDomain: %w", err)
return nil, fmt.Errorf("error getting status of nodes in ComputeDomain: no nodes set yet")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical, but should this just be:

Suggested change
return nil, fmt.Errorf("error getting status of nodes in ComputeDomain: no nodes set yet")
return nil, fmt.Errorf("no nodes set")

(I would remove the yet personally)

Going from Nodes == nil to error getting status of nodes implies intimate knowledge of GetComputDomain. Would it be better to just return an error from there instead?

We could also remove this check entirely sincelen(nil) == 0 it should fail the check bellow as well (assuming that NumNodes == 0 is not a valid configuration).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evan! Thanks for looking :)

... implies intimate knowledge of GetComputDomain. Would it be better to just return an error from there instead?

Good question. I'd say "no" because GetComputeDomain() is expected to return the CD object regardless of where the CD is in its lifecycle.

Hence, it can at times return a CD that does not have .Status.Nodes populated yet. That's an expected outcome IIUC.

Node status is added out-of-band via (potentially many calls from different nodes to) AddNodeStatusToComputeDomain() at an unpredictable point in time.

That is, via multiple calls to GetNodeIPs() we detect and respond to changes in a CD's lifecycle. Only within GetNodeIPs() it is an error when no node status has been set yet.

should this just be: ... no nodes set

I like to shorten this further, yes! In terms of concatenation of error messages, let's optimize for readability in case where this is prefixed with "error getting node IPs: ..."

I like "error getting node IPs: no nodes set yet for ComputeDomain"

I would remove the yet personally

The "yet" does not convey critical information, agree. It does however vaguely communicate the notion of an expected, temporary error (as opposed to unexpected, permanent error -- a distinction that is best done not just with words like "yet" :-)).

We could also remove this check entirely sincelen(nil) == 0

Ack. But I think it's good to have the fine-grained output so that we can see state transitioning explicitly: no nodes seen -> not all nodes seen -> all nodes seen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think having different error messages for the two cases is fine. On the use of "yet". Looking at the logic, it is not within the scope of this function to determine whether the error is treated as transient or not. I would then remove the "yet" and have intent to retry conveyed in the caller or through some other mechanism.

"error getting node IPs: no nodes set for ComputeDomain"

sounds good. (I won't fight hard on the "yet" :) )

Copy link
Member

@elezar elezar Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that the error getting node IPs would beis prepended in the caller, so here we would only have:

Suggested change
return nil, fmt.Errorf("error getting status of nodes in ComputeDomain: no nodes set yet")
return nil, fmt.Errorf("no nodes set for ComputeDomain")

Update: The caller is:

	nodeIPs, err := s.manager.GetNodeIPs(ctx, s.domain)
	if err != nil {
		return fmt.Errorf("error getting node IPs: %w", err)
	}

Copy link
Collaborator Author

@jgehrcke jgehrcke Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate your arguments!

the "error getting node IPs" is prepended in the caller

:) yesss. That is what I meant when I wrote

where this is prefixed with "error getting node IPs: ..."

About "yet":

have intent to retry conveyed in the caller or through some other mechanism

it is not within the scope of this function to determine whether the error is treated as transient or not.

I appreciate this perspective. I share this perspective, had also shared it before.

I did weigh these above arguments against other arguments, because what's also true:

  • no entity can distinguish those two cases -- only time does tell, in combination with a timeout criterion
  • we know that typically this is a transient error (in the vast majority of the cases: all happy path cases)
  • the fact that things "will (probably?) be retried again soon" is not easy to log in this case

Considering that big picture, adding the "yet" was my pragmatic way to indicate to the not-so-informed reader that this is generally not an error to worry about, a little bit of value added with just three letters.

I will remove the "yet": because we're all informed log readers :).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done yet? Not yet, we haven't yet discussed:

not all nodes populated in ComputeDomain status yet

That's the current error message right below the one we're discussing here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed commit, without the "yet" :).

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jgehrcke. Looks great now.

@jgehrcke jgehrcke merged commit b26e244 into NVIDIA:main Apr 25, 2025
7 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants