Add liveness probe to kubelet-plugin #453

guptaNswati · 2025-08-04T19:56:50Z

Backport liveness probe from example-dra-driver. Code taken from kubernetes-sigs/dra-example-driver#104

copy-pr-bot · 2025-08-04T19:56:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot · 2025-08-04T20:50:41Z

/ok to test

@guptaNswati, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

copy-pr-bot · 2025-08-04T20:53:37Z

/ok to test e6ab222

@guptaNswati, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

guptaNswati · 2025-08-04T20:55:13Z

/ok to test 569c69a

guptaNswati · 2025-08-04T23:26:13Z

Quick test:

$ describe pod
Restart Count:  0
    Liveness:       grpc <pod>:51518 liveness delay=0s timeout=1s period=10s #success=1 #failure=3
  
$ exec pod
[root@nvidia-dra-driver-gpu-kubelet-plugin-d7h5r /]# grpc_health_probe -addr=:51518
status: SERVING

ToDo: simulate a failure test

guptaNswati · 2025-08-05T17:48:56Z

cc @nojnhuh

guptaNswati · 2025-08-05T20:00:08Z

Just looked at this kubernetes-sigs/dra-example-driver#113

make kubelet path configurable:

klueska · 2025-08-13T10:33:13Z

Let's first backport this so that the diff for the changes in the PR are even more minimal:
kubernetes-sigs/dra-example-driver#98

klueska · 2025-08-13T19:05:59Z

That last commit should be its own PR linked to the issue describing it as well as the equivalent PR in the example driver repo.

guptaNswati · 2025-08-13T19:57:12Z

That last commit should be its own PR linked to the issue describing it as well as the equivalent PR in the example driver repo.

Oh.. To keep parity with the example-driver? Or this will make it easier to review?

Updated: #464

jgehrcke · 2025-08-19T13:33:46Z

cmd/gpu-kubelet-plugin/main.go

+		},
+		&cli.StringFlag{
+			Name:        "kubelet-registrar-directory-path",
+			Usage:       "Absolute path to the directory where kubelet stores plugin registrations.",


IIUC, this is the directory that the kubelet monitors for unix domain sockets (files) placed by plugins.

So, we should change the text to something like

"Absolute path to the directory that the kubelet monitors for unix domain sockets for plugin registration".

(the kubelet does not store plugin registrations there, that I believe is a wrong concept)

jgehrcke · 2025-08-19T14:04:53Z

deployments/helm/nvidia-dra-driver-gpu/values.yaml

      resources: {}
+      # Port running a gRPC health service checked by a livenessProbe.
+      # Set to a negative value to disable the service and the probe.
+      healthcheckPort: 51518


What can we say about this port number, and it potentially ever being in conflict with something else?

The kubelet plugin does not use hostNetworking: true -- in that sense, it probably has all ports to itself. Is that right (I am speculating here, one of us should understand this deeply! :)).

If there is no conflict, ever, we can pick a different number. One that looks less magic.

Honestly, i dint think much about it. Based on some reading... There are 65k+ ports available and this number is so arbitrary that its rare to conflict with usual port numbers like 5000s or 8080s (standard health and monitoring ports). And like you said, without shared host network, each pod has unlimited (more than 65k) ports where i cannot imagine them being able to consume all or this specific one.

upstream sets it to: healthcheckPort: 51515.. when i dint know, deviation from upstream is not wanted :)

Why not just leave it as 51515 and then make the other one 51516 to keep the diffs as small as possible?

jgehrcke · 2025-08-19T14:15:26Z

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

        resources:
          {{- toYaml .Values.kubeletPlugin.containers.computeDomains.resources | nindent 10 }}
+        {{/*
+          A literal "0" will allocate a random port. Don't configure the probe


Hm. When instantiating a TCP listener, a literal 0 will lead to a system-provided port -- yes.

Here, we specify where to connect to (if I am not mistaken), so specifying zero probably never makes quite sense. Happy to learn where I am off!

Port to start a gRPC healthcheck service. When positive, a literal port number. When zero, a random port is allocated. When negative, the healthcheck service is disabled

Thats how you can configure them in helm. this is the default 51515

The code below has this check:

{{- if (gt (int .Values.kubeletPlugin.containers.plugin.healthcheckPort) 0) }}

So if one was to specify 0 in the values.yaml file, then the section to specify the port below will be omitted.
Essentially not running a liveness probe at all.

A better implementation would probably have been to put some code in validation.yaml to error out if healthcheckPort == 0 rather than silently omitting the liveness probe altogether.

Let's leave that for a follow-up so that this code continues to follow the original PR from the example driver as closely as possible.

jgehrcke · 2025-08-19T14:17:07Z

cmd/gpu-kubelet-plugin/main.go

 		},
+		&cli.IntFlag{
+			Name:        "healthcheck-port",
+			Usage:       "Port to start a gRPC healthcheck service. When positive, a literal port number. When zero, a random port is allocated. When negative, the healthcheck service is disabled.",


a random port

It's not quite random. The operating system yields the next free one.

(for your/our understanding, I don't ask to change the wording -- if you want to change the wording however, please go ahead :)).

you don't always know without looking whats the next port available so for a general user, it is random :-D

I think the technically correct word here would be "arbitrary", not "random", but that's a bit pedantic. We can revisit this as a follow-up since this is an exact copy of the original PR and I'd like to keep the diffs as small as possible to ensure correctness.

jgehrcke · 2025-08-19T14:17:52Z

cmd/gpu-kubelet-plugin/health.go

+	}
+	klog.V(6).Infof("Successfully invoked GetInfo: %v", info)
+
+	_, err = h.draClient.NodePrepareResources(ctx, &drapb.NodePrepareResourcesRequest{})


What is this after all doing? And what is this signaling when it succeeds?
CC @klueska

without digging in the code too much, my understanding is that here we basically care about the error. whether a call to NodePrepareResources() is successful or not. A nil error would mean drivers are ready, nodes are ready and everything is an expected state.

This is consistent with the code in the original PR. If we want to follow-up later we can, but I think this is OK for now.

jgehrcke · 2025-08-19T14:20:11Z

cmd/gpu-kubelet-plugin/health.go

+		klog.ErrorS(err, "failed to call NodePrepareResources")
+		return status, nil
+	}
+	klog.V(6).Info("Successfully invoked NodePrepareResources")


I believe that maybe we don't need to log this every time this succeeds -- I understand that this is what we did upstream, and maybe we want to keep changes minimal. But still, curious about your opinion.

This is a log level v6, so I think it's fine.

klueska · 2025-08-19T19:10:19Z

We should make sure are track this as part of this PR as well:
kubernetes-sigs/dra-example-driver#107

guptaNswati · 2025-08-20T17:38:24Z

We should make sure are track this as part of this PR as well: kubernetes-sigs/dra-example-driver#107

yup, already linked as a follow-up #453 (comment)

cmd/gpu-kubelet-plugin/driver.go

klueska · 2025-08-21T12:01:04Z

cmd/compute-domain-kubelet-plugin/main.go

 	DriverName                         = "compute-domain.nvidia.com"
 	DriverPluginPath                   = "/var/lib/kubelet/plugins/" + DriverName
 	DriverPluginCheckpointFileBasename = "checkpoint.json"
+	DriverRegistrarPath                = "/var/lib/kubelet/plugins_registry"


This is not needed

klueska · 2025-08-21T12:01:49Z

cmd/gpu-kubelet-plugin/main.go

 	DriverName                         = "gpu.nvidia.com"
 	DriverPluginPath                   = "/var/lib/kubelet/plugins/" + DriverName
 	DriverPluginCheckpointFileBasename = "checkpoint.json"
+	DriverRegistrarPath                = "/var/lib/kubelet/plugins_registry"


This is not needed

klueska · 2025-08-21T12:07:52Z

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

        resources:
          {{- toYaml .Values.kubeletPlugin.containers.computeDomains.resources | nindent 10 }}
+        {{/*
+          A literal "0" will allocate a random port. Don't configure the probe


The code below has this check:

{{- if (gt (int .Values.kubeletPlugin.containers.plugin.healthcheckPort) 0) }}

So if one was to specify 0 in the values.yaml file, then the section to specify the port below will be omitted.
Essentially not running a liveness probe at all.

A better implementation would probably have been to put some code in validation.yaml to error out if healthcheckPort == 0 rather than silently omitting the liveness probe altogether.

Let's leave that for a follow-up so that this code continues to follow the original PR from the example driver as closely as possible.

klueska · 2025-08-21T12:10:58Z

deployments/helm/nvidia-dra-driver-gpu/values.yaml

      resources: {}
+      # Port running a gRPC health service checked by a livenessProbe.
+      # Set to a negative value to disable the service and the probe.
+      healthcheckPort: 51518


Why not just leave it as 51515 and then make the other one 51516 to keep the diffs as small as possible?

cmd/compute-domain-kubelet-plugin/health.go

cmd/gpu-kubelet-plugin/health.go

guptaNswati · 2025-08-22T00:29:44Z

Another quick test after all the changes.

$ kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-jwr4m   -n nvidia-dra-driver-gpu -c compute-domains | grep sock
I0821 23:52:18.869366       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/compute-domain.nvidia.com/dra.sock"
I0821 23:52:18.869468       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/compute-domain.nvidia.com-reg.sock"
I0821 23:52:19.138501       1 nonblockinggrpcserver.go:161] "handling request succeeded" logger="registrar" requestID=1 method="/pluginregistration.Registration/GetInfo" response="type:\"DRAPlugin\" name:\"compute-domain.nvidia.com\" endpoint:\"/var/lib/kubelet/plugins/compute-domain.nvidia.com/dra.sock\" supported_versions:\"v1.DRAPlugin\" supported_versions:\"v1beta1.DRAPlugin\""

$ grpcurl -plaintext   -import-path /tmp   -proto /tmp/kubelet_pluginregistration.proto   unix:///var/lib/kubelet/plugins_registry/gpu.nvidia.com-reg.sock   pluginregistration.Registration/GetInfo
{
  "type": "DRAPlugin",
  "name": "gpu.nvidia.com",
  "endpoint": "/var/lib/kubelet/plugins/gpu.nvidia.com/dra.sock",
  "supportedVersions": [
    "v1.DRAPlugin",
    "v1beta1.DRAPlugin"
  ]
}

Signed-off-by: Swati Gupta <[email protected]>

klueska

I think this is fine for now. It is a minimal diff to the original PR with the only real changes in logging code. We should do a larger refactor / update to our logging strategy to fix these inconsistencies.

klueska · 2025-08-25T12:52:48Z

With this PR merged I am not able to install the helm chart out of the box:

$ helm install nvidia-dra-driver-gpu deployments/helm/nvidia-dra-driver-gpu   --create-namespace  
 --namespace nvidia-dra-driver-gpu   --set nvidiaDriverRoot=/run/nvidia/driver   --set gpuResourcesEnabledOverride=true   --set controller.containers.computeDomain.env[0].name=ADDITIONAL_NAMESPACES   --set controller.containers.computeDomain.env[0].value=gpu-operator  --set webhook.replicas=2 --set webhook.enabled=true
Error: INSTALLATION FAILED: template: nvidia-dra-driver-gpu/templates/kubeletplugin.yaml:110:31: executing "nvidia-dra-driver-gpu/templates/kubeletplugin.yaml" at <.Values.kubeletPlugin.containers.plugin.healthcheckPort>: nil pointer evaluating interface {}.healthcheckPort

guptaNswati requested a review from klueska August 4, 2025 23:26

jgehrcke added this to the v25.8.0 milestone Aug 11, 2025

klueska added this to Planning Board: k8s-dra-driver-gpu Aug 13, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Aug 13, 2025

klueska assigned guptaNswati Aug 13, 2025

klueska added the feature issue/PR that proposes a new feature or functionality label Aug 13, 2025

guptaNswati mentioned this pull request Aug 13, 2025

make kubelet plugin dir configurable #464

Merged

jgehrcke reviewed Aug 19, 2025

View reviewed changes

guptaNswati force-pushed the liveness-probe-kp branch from 88c290d to 03d7649 Compare August 20, 2025 17:33

klueska reviewed Aug 21, 2025

View reviewed changes

guptaNswati added 2 commits August 22, 2025 03:04

Add liveness probe to gpu-plugin

89085b5

Signed-off-by: Swati Gupta <[email protected]>

helm update for gpu-plugin healthport

13a5c68

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati added 9 commits August 22, 2025 03:07

Add livenes proble to compute-domain-plugin

32d02bd

Signed-off-by: Swati Gupta <[email protected]>

helm update for compute-domain

c441dc4

Signed-off-by: Swati Gupta <[email protected]>

vendor update

d8ffd87

Signed-off-by: Swati Gupta <[email protected]>

Update go.mod

d587099

Signed-off-by: Swati Gupta <[email protected]>

vendor fixes

3319344

Signed-off-by: Swati Gupta <[email protected]>

add grpc vendor dir

30cfbf9

Signed-off-by: Swati Gupta <[email protected]>

fix registration path

e2a2419

Signed-off-by: Swati Gupta <[email protected]>

address PR comments and keep things similar to upstream

ab7a2e3

Signed-off-by: Swati Gupta <[email protected]>

fix go mod

9548a36

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the liveness-probe-kp branch from 6e8c56d to 9548a36 Compare August 22, 2025 03:23

linter fix

9ecf47f

Signed-off-by: Swati Gupta <[email protected]>

klueska approved these changes Aug 22, 2025

View reviewed changes

guptaNswati merged commit d292c81 into NVIDIA:main Aug 22, 2025
7 checks passed

klueska moved this from Backlog to Closed in Planning Board: k8s-dra-driver-gpu Sep 23, 2025

Add liveness probe to kubelet-plugin #453

Add liveness probe to kubelet-plugin #453

Uh oh!

Conversation

guptaNswati commented Aug 4, 2025

Uh oh!

copy-pr-bot bot commented Aug 4, 2025

Uh oh!

copy-pr-bot bot commented Aug 4, 2025

Uh oh!

copy-pr-bot bot commented Aug 4, 2025

Uh oh!

guptaNswati commented Aug 4, 2025

Uh oh!

guptaNswati commented Aug 4, 2025

Uh oh!

guptaNswati commented Aug 5, 2025

Uh oh!

guptaNswati commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klueska commented Aug 13, 2025

Uh oh!

klueska commented Aug 13, 2025

Uh oh!

guptaNswati commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klueska commented Aug 19, 2025

Uh oh!

guptaNswati commented Aug 20, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guptaNswati commented Aug 22, 2025

Uh oh!

guptaNswati commented Aug 5, 2025 •

edited

Loading

guptaNswati commented Aug 13, 2025 •

edited

Loading