-
Notifications
You must be signed in to change notification settings - Fork 99
Move logic to run / check status of IMEX daemons into go binary #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5309fbf to
4b5bc74
Compare
| signal.Notify(sigChan, syscall.SIGTERM) | ||
| go func() { | ||
| <-sigChan | ||
| cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to emit a log message like "got SIGTERM, initiating shutdown" or something like that, to see how snappy and coordinated the shutdown procedure is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that. Will add
| return f(ctx) | ||
| } | ||
|
|
||
| // Create the app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that an AI code comment? :-) haha (sorry, not judging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, its the logical progression of comments from whats before and whats after (if all you did was read the comments to understand at a high level what this function is doing as you walk through it.
| }, | ||
| { | ||
| Name: "check", | ||
| Usage: "Check if the node is IMEX capable and if the IMEX daemon is ready", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opinion on "IMEX capable" vs. "IMEX-capable"?
| if !capable { | ||
| fmt.Println("ClusterUUID and CliqueId are NOT set for GPUs on this node.") | ||
| fmt.Println("The IMEX daemon will not be started.") | ||
| fmt.Println("Sleeping forever...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we emit these details within checkIMEXCapable(), closer to the code where we check for these things?
here, we could then log a neutral "not IMEX-capable"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it makes more sense to do it at the top level rather than buried in a helper function since this is info for the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about maintainability because the relationship between !capable and "ClusterUUID and CliqueId are NOT set" is defined elsewhere and might change (and we might forget changing the user output then).
(this is nit-level and I am OK with whatever you choose)
cmd/compute-domain-daemon/nvml.go
Outdated
| // It logs any errors that occur during shutdown but does not return them, | ||
| // as this is typically called in a defer statement. | ||
| func (l deviceLib) alwaysShutdown() { | ||
| ret := l.nvmllib.Shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what's in this library that needs to be shut down. Does it implement some kind of server or event handler / queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what all the underlying NVML library does, but (at a minimum) it drops the handle tit holds o the GPU driver (allowing e.g. a driver kernel module to be removed).
| cmd := exec.CommandContext(ctx, imexBinary, "-c", config) | ||
| cmd.Stdout = os.Stdout | ||
| cmd.Stderr = os.Stderr | ||
| return cmd.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot some details of what we talked about. But now when we wrap the IMEX daemon process like this, let's review some config settings. Maybe:
LOG_FILE_NAME=""-- according to docs, if empty, this emits to stderrDAEMONIZE=0-- maybe that's even required now to properly shut it down?
We currently also still have IMEX_WAIT_FOR_QUORUM=RECOVERY
Whereas I think in our conversations we explored setting this to
NONE: Do not wait for any quorum with other nodes.
Arguably, I have not fully understood RECOVERY yet -- docs start with "In case of unsafe IMEX termination", so this is probably about a previous crash. Probably RECOVERY is the same as NONE when there was no previous crash ('unsafe IMEX termination').
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important stuff to consider, but I don't want to do that in this PR -- this is just about migrating what we had previously to a go binary instead of a bash script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree!
4b5bc74 to
134e75e
Compare
| // tail continuously reads and prints new lines from the specified file using the system's tail command. | ||
| // It starts from the beginning of the file (-n +1) and follows new lines (-f). | ||
| // It blocks until the context is cancelled or an error occurs. | ||
| func tail(ctx context.Context, path string) error { | ||
| cmd := exec.CommandContext(ctx, "tail", "-n", "+1", "-f", path) | ||
| cmd.Stdout = os.Stdout | ||
| cmd.Stderr = os.Stderr | ||
| return cmd.Run() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of direct stdout / stderr, maybe I should intercept each line of output so that we can log it with klog and keep it consistent with other log messages being emitted by this program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of like having a tight process wrapper where we make the child process cleanly inherit parent's stdout/stderr, and make the child (well, the imex daemon) emit its log to stderr.
But I also like what you propose, we then have consistent timestamps and given the right needle one can isolate the daemon's output with grep.
We probably will iterate on this later anyway.
jgehrcke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm going to be away for a bit: feel free to land this when it feels good to you.
Signed-off-by: Kevin Klues <[email protected]>
134e75e to
5382099
Compare
Signed-off-by: Kevin Klues <[email protected]>
5382099 to
eb75a1c
Compare
|
I'm going to land this as-is, as I have a number of follow-up PRs that build on this. |
This PR is actually only the top-commit. The other commit is from #376 and will go away once this is merged.
The motivation for this is as follows:
These new features will be added as follow-up PRs.