Skip to content

Conversation

@klueska
Copy link
Collaborator

@klueska klueska commented May 27, 2025

This PR is actually only the top-commit. The other commit is from #376 and will go away once this is merged.

The motivation for this is as follows:

  1. We want to simplify the code in the helm chart to not rely on bash scripts
  2. We want to add extra logic into this binary to eventually handle IMEX node IPs as hostname
  3. We want to use this binary to start surfacing error conditions / metrics from running the IMEX daemon to the ComuteDomain object.

These new features will be added as follow-up PRs.

@copy-pr-bot
Copy link

copy-pr-bot bot commented May 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@klueska klueska requested a review from jgehrcke May 27, 2025 21:56
@klueska klueska force-pushed the compute-domain-daemon-cmd branch 4 times, most recently from 5309fbf to 4b5bc74 Compare May 27, 2025 23:21
signal.Notify(sigChan, syscall.SIGTERM)
go func() {
<-sigChan
cancel()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to emit a log message like "got SIGTERM, initiating shutdown" or something like that, to see how snappy and coordinated the shutdown procedure is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that. Will add

return f(ctx)
}

// Create the app
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an AI code comment? :-) haha (sorry, not judging)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, its the logical progression of comments from whats before and whats after (if all you did was read the comments to understand at a high level what this function is doing as you walk through it.

},
{
Name: "check",
Usage: "Check if the node is IMEX capable and if the IMEX daemon is ready",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opinion on "IMEX capable" vs. "IMEX-capable"?

if !capable {
fmt.Println("ClusterUUID and CliqueId are NOT set for GPUs on this node.")
fmt.Println("The IMEX daemon will not be started.")
fmt.Println("Sleeping forever...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we emit these details within checkIMEXCapable(), closer to the code where we check for these things?

here, we could then log a neutral "not IMEX-capable"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it makes more sense to do it at the top level rather than buried in a helper function since this is info for the end user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about maintainability because the relationship between !capable and "ClusterUUID and CliqueId are NOT set" is defined elsewhere and might change (and we might forget changing the user output then).

(this is nit-level and I am OK with whatever you choose)

// It logs any errors that occur during shutdown but does not return them,
// as this is typically called in a defer statement.
func (l deviceLib) alwaysShutdown() {
ret := l.nvmllib.Shutdown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what's in this library that needs to be shut down. Does it implement some kind of server or event handler / queue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what all the underlying NVML library does, but (at a minimum) it drops the handle tit holds o the GPU driver (allowing e.g. a driver kernel module to be removed).

cmd := exec.CommandContext(ctx, imexBinary, "-c", config)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
return cmd.Run()
Copy link
Collaborator

@jgehrcke jgehrcke May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot some details of what we talked about. But now when we wrap the IMEX daemon process like this, let's review some config settings. Maybe:

  • LOG_FILE_NAME="" -- according to docs, if empty, this emits to stderr
  • DAEMONIZE=0 -- maybe that's even required now to properly shut it down?

We currently also still have IMEX_WAIT_FOR_QUORUM=RECOVERY

(https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/9732ac32c1e77dc9b08cf741830b5fb3529787d4/templates/compute-domain-daemon-config.tmpl.cfg#L106C1-L106C30)

Whereas I think in our conversations we explored setting this to

NONE: Do not wait for any quorum with other nodes.

Arguably, I have not fully understood RECOVERY yet -- docs start with "In case of unsafe IMEX termination", so this is probably about a previous crash. Probably RECOVERY is the same as NONE when there was no previous crash ('unsafe IMEX termination').

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important stuff to consider, but I don't want to do that in this PR -- this is just about migrating what we had previously to a go binary instead of a bash script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

@klueska klueska force-pushed the compute-domain-daemon-cmd branch from 4b5bc74 to 134e75e Compare May 29, 2025 12:55
Comment on lines +239 to +184
// tail continuously reads and prints new lines from the specified file using the system's tail command.
// It starts from the beginning of the file (-n +1) and follows new lines (-f).
// It blocks until the context is cancelled or an error occurs.
func tail(ctx context.Context, path string) error {
cmd := exec.CommandContext(ctx, "tail", "-n", "+1", "-f", path)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
return cmd.Run()
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of direct stdout / stderr, maybe I should intercept each line of output so that we can log it with klog and keep it consistent with other log messages being emitted by this program.

Copy link
Collaborator

@jgehrcke jgehrcke May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like having a tight process wrapper where we make the child process cleanly inherit parent's stdout/stderr, and make the child (well, the imex daemon) emit its log to stderr.

But I also like what you propose, we then have consistent timestamps and given the right needle one can isolate the daemon's output with grep.

We probably will iterate on this later anyway.

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm going to be away for a bit: feel free to land this when it feels good to you.

@klueska klueska force-pushed the compute-domain-daemon-cmd branch from 134e75e to 5382099 Compare May 29, 2025 21:39
@klueska klueska force-pushed the compute-domain-daemon-cmd branch from 5382099 to eb75a1c Compare May 29, 2025 22:05
@klueska
Copy link
Collaborator Author

klueska commented Jun 1, 2025

I'm going to land this as-is, as I have a number of follow-up PRs that build on this.

@klueska klueska merged commit 7ded33d into NVIDIA:main Jun 1, 2025
7 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
@klueska klueska deleted the compute-domain-daemon-cmd branch August 20, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants