-
Notifications
You must be signed in to change notification settings - Fork 99
Add liveness probe to kubelet-plugin #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@guptaNswati, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
@guptaNswati, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 569c69a |
|
Quick test: ToDo: simulate a failure test |
|
cc @nojnhuh |
|
Just looked at this kubernetes-sigs/dra-example-driver#113 make kubelet path configurable: |
|
Let's first backport this so that the diff for the changes in the PR are even more minimal: |
|
That last commit should be its own PR linked to the issue describing it as well as the equivalent PR in the example driver repo. |
Oh.. To keep parity with the example-driver? Or this will make it easier to review? Updated: #464 |
| }, | ||
| &cli.StringFlag{ | ||
| Name: "kubelet-registrar-directory-path", | ||
| Usage: "Absolute path to the directory where kubelet stores plugin registrations.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this is the directory that the kubelet monitors for unix domain sockets (files) placed by plugins.
So, we should change the text to something like
"Absolute path to the directory that the kubelet monitors for unix domain sockets for plugin registration".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the kubelet does not store plugin registrations there, that I believe is a wrong concept)
| resources: {} | ||
| # Port running a gRPC health service checked by a livenessProbe. | ||
| # Set to a negative value to disable the service and the probe. | ||
| healthcheckPort: 51518 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What can we say about this port number, and it potentially ever being in conflict with something else?
The kubelet plugin does not use hostNetworking: true -- in that sense, it probably has all ports to itself. Is that right (I am speculating here, one of us should understand this deeply! :)).
If there is no conflict, ever, we can pick a different number. One that looks less magic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, i dint think much about it. Based on some reading... There are 65k+ ports available and this number is so arbitrary that its rare to conflict with usual port numbers like 5000s or 8080s (standard health and monitoring ports). And like you said, without shared host network, each pod has unlimited (more than 65k) ports where i cannot imagine them being able to consume all or this specific one.
upstream sets it to: healthcheckPort: 51515.. when i dint know, deviation from upstream is not wanted :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just leave it as 51515 and then make the other one 51516 to keep the diffs as small as possible?
| resources: | ||
| {{- toYaml .Values.kubeletPlugin.containers.computeDomains.resources | nindent 10 }} | ||
| {{/* | ||
| A literal "0" will allocate a random port. Don't configure the probe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. When instantiating a TCP listener, a literal 0 will lead to a system-provided port -- yes.
Here, we specify where to connect to (if I am not mistaken), so specifying zero probably never makes quite sense. Happy to learn where I am off!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Port to start a gRPC healthcheck service. When positive, a literal port number. When zero, a random port is allocated. When negative, the healthcheck service is disabled
Thats how you can configure them in helm. this is the default 51515
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code below has this check:
{{- if (gt (int .Values.kubeletPlugin.containers.plugin.healthcheckPort) 0) }}
So if one was to specify 0 in the values.yaml file, then the section to specify the port below will be omitted.
Essentially not running a liveness probe at all.
A better implementation would probably have been to put some code in validation.yaml to error out if healthcheckPort == 0 rather than silently omitting the liveness probe altogether.
Let's leave that for a follow-up so that this code continues to follow the original PR from the example driver as closely as possible.
cmd/gpu-kubelet-plugin/main.go
Outdated
| }, | ||
| &cli.IntFlag{ | ||
| Name: "healthcheck-port", | ||
| Usage: "Port to start a gRPC healthcheck service. When positive, a literal port number. When zero, a random port is allocated. When negative, the healthcheck service is disabled.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a random port
It's not quite random. The operating system yields the next free one.
(for your/our understanding, I don't ask to change the wording -- if you want to change the wording however, please go ahead :)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't always know without looking whats the next port available so for a general user, it is random :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the technically correct word here would be "arbitrary", not "random", but that's a bit pedantic. We can revisit this as a follow-up since this is an exact copy of the original PR and I'd like to keep the diffs as small as possible to ensure correctness.
| } | ||
| klog.V(6).Infof("Successfully invoked GetInfo: %v", info) | ||
|
|
||
| _, err = h.draClient.NodePrepareResources(ctx, &drapb.NodePrepareResourcesRequest{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this after all doing? And what is this signaling when it succeeds?
CC @klueska
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without digging in the code too much, my understanding is that here we basically care about the error. whether a call to NodePrepareResources() is successful or not. A nil error would mean drivers are ready, nodes are ready and everything is an expected state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the code in the original PR. If we want to follow-up later we can, but I think this is OK for now.
| klog.ErrorS(err, "failed to call NodePrepareResources") | ||
| return status, nil | ||
| } | ||
| klog.V(6).Info("Successfully invoked NodePrepareResources") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that maybe we don't need to log this every time this succeeds -- I understand that this is what we did upstream, and maybe we want to keep changes minimal. But still, curious about your opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a log level v6, so I think it's fine.
|
We should make sure are track this as part of this PR as well: |
88c290d to
03d7649
Compare
yup, already linked as a follow-up #453 (comment) |
| DriverName = "compute-domain.nvidia.com" | ||
| DriverPluginPath = "/var/lib/kubelet/plugins/" + DriverName | ||
| DriverPluginCheckpointFileBasename = "checkpoint.json" | ||
| DriverRegistrarPath = "/var/lib/kubelet/plugins_registry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed
cmd/gpu-kubelet-plugin/main.go
Outdated
| DriverName = "gpu.nvidia.com" | ||
| DriverPluginPath = "/var/lib/kubelet/plugins/" + DriverName | ||
| DriverPluginCheckpointFileBasename = "checkpoint.json" | ||
| DriverRegistrarPath = "/var/lib/kubelet/plugins_registry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed
| resources: | ||
| {{- toYaml .Values.kubeletPlugin.containers.computeDomains.resources | nindent 10 }} | ||
| {{/* | ||
| A literal "0" will allocate a random port. Don't configure the probe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code below has this check:
{{- if (gt (int .Values.kubeletPlugin.containers.plugin.healthcheckPort) 0) }}
So if one was to specify 0 in the values.yaml file, then the section to specify the port below will be omitted.
Essentially not running a liveness probe at all.
A better implementation would probably have been to put some code in validation.yaml to error out if healthcheckPort == 0 rather than silently omitting the liveness probe altogether.
Let's leave that for a follow-up so that this code continues to follow the original PR from the example driver as closely as possible.
| resources: {} | ||
| # Port running a gRPC health service checked by a livenessProbe. | ||
| # Set to a negative value to disable the service and the probe. | ||
| healthcheckPort: 51518 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just leave it as 51515 and then make the other one 51516 to keep the diffs as small as possible?
|
Another quick test after all the changes. |
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
6e8c56d to
9548a36
Compare
Signed-off-by: Swati Gupta <[email protected]>
klueska
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now. It is a minimal diff to the original PR with the only real changes in logging code. We should do a larger refactor / update to our logging strategy to fix these inconsistencies.
|
With this PR merged I am not able to install the helm chart out of the box: |
Backport liveness probe from example-dra-driver. Code taken from kubernetes-sigs/dra-example-driver#104