Skip to content

Conversation

@guptaNswati
Copy link
Contributor

Backport liveness probe from example-dra-driver. Code taken from kubernetes-sigs/dra-example-driver#104

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 4, 2025

/ok to test

@guptaNswati, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 4, 2025

/ok to test e6ab222

@guptaNswati, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@guptaNswati
Copy link
Contributor Author

/ok to test 569c69a

@guptaNswati
Copy link
Contributor Author

Quick test:

$ describe pod
Restart Count:  0
    Liveness:       grpc <pod>:51518 liveness delay=0s timeout=1s period=10s #success=1 #failure=3
  
$ exec pod
[root@nvidia-dra-driver-gpu-kubelet-plugin-d7h5r /]# grpc_health_probe -addr=:51518
status: SERVING

ToDo: simulate a failure test

@guptaNswati guptaNswati requested a review from klueska August 4, 2025 23:26
@guptaNswati
Copy link
Contributor Author

cc @nojnhuh

@guptaNswati
Copy link
Contributor Author

guptaNswati commented Aug 5, 2025

@jgehrcke jgehrcke added this to the v25.8.0 milestone Aug 11, 2025
@klueska
Copy link
Collaborator

klueska commented Aug 13, 2025

Let's first backport this so that the diff for the changes in the PR are even more minimal:
kubernetes-sigs/dra-example-driver#98

@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Aug 13, 2025
@klueska
Copy link
Collaborator

klueska commented Aug 13, 2025

That last commit should be its own PR linked to the issue describing it as well as the equivalent PR in the example driver repo.

@guptaNswati
Copy link
Contributor Author

guptaNswati commented Aug 13, 2025

That last commit should be its own PR linked to the issue describing it as well as the equivalent PR in the example driver repo.

Oh.. To keep parity with the example-driver? Or this will make it easier to review?

Updated: #464

},
&cli.StringFlag{
Name: "kubelet-registrar-directory-path",
Usage: "Absolute path to the directory where kubelet stores plugin registrations.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is the directory that the kubelet monitors for unix domain sockets (files) placed by plugins.

So, we should change the text to something like

"Absolute path to the directory that the kubelet monitors for unix domain sockets for plugin registration".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the kubelet does not store plugin registrations there, that I believe is a wrong concept)

resources: {}
# Port running a gRPC health service checked by a livenessProbe.
# Set to a negative value to disable the service and the probe.
healthcheckPort: 51518
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can we say about this port number, and it potentially ever being in conflict with something else?

The kubelet plugin does not use hostNetworking: true -- in that sense, it probably has all ports to itself. Is that right (I am speculating here, one of us should understand this deeply! :)).

If there is no conflict, ever, we can pick a different number. One that looks less magic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, i dint think much about it. Based on some reading... There are 65k+ ports available and this number is so arbitrary that its rare to conflict with usual port numbers like 5000s or 8080s (standard health and monitoring ports). And like you said, without shared host network, each pod has unlimited (more than 65k) ports where i cannot imagine them being able to consume all or this specific one.

upstream sets it to: healthcheckPort: 51515.. when i dint know, deviation from upstream is not wanted :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just leave it as 51515 and then make the other one 51516 to keep the diffs as small as possible?

resources:
{{- toYaml .Values.kubeletPlugin.containers.computeDomains.resources | nindent 10 }}
{{/*
A literal "0" will allocate a random port. Don't configure the probe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. When instantiating a TCP listener, a literal 0 will lead to a system-provided port -- yes.

Here, we specify where to connect to (if I am not mistaken), so specifying zero probably never makes quite sense. Happy to learn where I am off!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port to start a gRPC healthcheck service. When positive, a literal port number. When zero, a random port is allocated. When negative, the healthcheck service is disabled

Thats how you can configure them in helm. this is the default 51515

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code below has this check:

{{- if (gt (int .Values.kubeletPlugin.containers.plugin.healthcheckPort) 0) }}

So if one was to specify 0 in the values.yaml file, then the section to specify the port below will be omitted.
Essentially not running a liveness probe at all.

A better implementation would probably have been to put some code in validation.yaml to error out if healthcheckPort == 0 rather than silently omitting the liveness probe altogether.

Let's leave that for a follow-up so that this code continues to follow the original PR from the example driver as closely as possible.

},
&cli.IntFlag{
Name: "healthcheck-port",
Usage: "Port to start a gRPC healthcheck service. When positive, a literal port number. When zero, a random port is allocated. When negative, the healthcheck service is disabled.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a random port

It's not quite random. The operating system yields the next free one.

(for your/our understanding, I don't ask to change the wording -- if you want to change the wording however, please go ahead :)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't always know without looking whats the next port available so for a general user, it is random :-D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the technically correct word here would be "arbitrary", not "random", but that's a bit pedantic. We can revisit this as a follow-up since this is an exact copy of the original PR and I'd like to keep the diffs as small as possible to ensure correctness.

}
klog.V(6).Infof("Successfully invoked GetInfo: %v", info)

_, err = h.draClient.NodePrepareResources(ctx, &drapb.NodePrepareResourcesRequest{})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this after all doing? And what is this signaling when it succeeds?
CC @klueska

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without digging in the code too much, my understanding is that here we basically care about the error. whether a call to NodePrepareResources() is successful or not. A nil error would mean drivers are ready, nodes are ready and everything is an expected state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the code in the original PR. If we want to follow-up later we can, but I think this is OK for now.

klog.ErrorS(err, "failed to call NodePrepareResources")
return status, nil
}
klog.V(6).Info("Successfully invoked NodePrepareResources")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that maybe we don't need to log this every time this succeeds -- I understand that this is what we did upstream, and maybe we want to keep changes minimal. But still, curious about your opinion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a log level v6, so I think it's fine.

@klueska
Copy link
Collaborator

klueska commented Aug 19, 2025

We should make sure are track this as part of this PR as well:
kubernetes-sigs/dra-example-driver#107

@guptaNswati
Copy link
Contributor Author

We should make sure are track this as part of this PR as well: kubernetes-sigs/dra-example-driver#107

yup, already linked as a follow-up #453 (comment)

DriverName = "compute-domain.nvidia.com"
DriverPluginPath = "/var/lib/kubelet/plugins/" + DriverName
DriverPluginCheckpointFileBasename = "checkpoint.json"
DriverRegistrarPath = "/var/lib/kubelet/plugins_registry"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed

DriverName = "gpu.nvidia.com"
DriverPluginPath = "/var/lib/kubelet/plugins/" + DriverName
DriverPluginCheckpointFileBasename = "checkpoint.json"
DriverRegistrarPath = "/var/lib/kubelet/plugins_registry"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed

resources:
{{- toYaml .Values.kubeletPlugin.containers.computeDomains.resources | nindent 10 }}
{{/*
A literal "0" will allocate a random port. Don't configure the probe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code below has this check:

{{- if (gt (int .Values.kubeletPlugin.containers.plugin.healthcheckPort) 0) }}

So if one was to specify 0 in the values.yaml file, then the section to specify the port below will be omitted.
Essentially not running a liveness probe at all.

A better implementation would probably have been to put some code in validation.yaml to error out if healthcheckPort == 0 rather than silently omitting the liveness probe altogether.

Let's leave that for a follow-up so that this code continues to follow the original PR from the example driver as closely as possible.

resources: {}
# Port running a gRPC health service checked by a livenessProbe.
# Set to a negative value to disable the service and the probe.
healthcheckPort: 51518
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just leave it as 51515 and then make the other one 51516 to keep the diffs as small as possible?

@guptaNswati
Copy link
Contributor Author

Another quick test after all the changes.

$ kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-jwr4m   -n nvidia-dra-driver-gpu -c compute-domains | grep sock
I0821 23:52:18.869366       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/compute-domain.nvidia.com/dra.sock"
I0821 23:52:18.869468       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/compute-domain.nvidia.com-reg.sock"
I0821 23:52:19.138501       1 nonblockinggrpcserver.go:161] "handling request succeeded" logger="registrar" requestID=1 method="/pluginregistration.Registration/GetInfo" response="type:\"DRAPlugin\" name:\"compute-domain.nvidia.com\" endpoint:\"/var/lib/kubelet/plugins/compute-domain.nvidia.com/dra.sock\" supported_versions:\"v1.DRAPlugin\" supported_versions:\"v1beta1.DRAPlugin\""

$ grpcurl -plaintext   -import-path /tmp   -proto /tmp/kubelet_pluginregistration.proto   unix:///var/lib/kubelet/plugins_registry/gpu.nvidia.com-reg.sock   pluginregistration.Registration/GetInfo
{
  "type": "DRAPlugin",
  "name": "gpu.nvidia.com",
  "endpoint": "/var/lib/kubelet/plugins/gpu.nvidia.com/dra.sock",
  "supportedVersions": [
    "v1.DRAPlugin",
    "v1beta1.DRAPlugin"
  ]
}


Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
Copy link
Collaborator

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now. It is a minimal diff to the original PR with the only real changes in logging code. We should do a larger refactor / update to our logging strategy to fix these inconsistencies.

@guptaNswati guptaNswati merged commit d292c81 into NVIDIA:main Aug 22, 2025
7 checks passed
@klueska
Copy link
Collaborator

klueska commented Aug 25, 2025

With this PR merged I am not able to install the helm chart out of the box:

$ helm install nvidia-dra-driver-gpu deployments/helm/nvidia-dra-driver-gpu   --create-namespace  
 --namespace nvidia-dra-driver-gpu   --set nvidiaDriverRoot=/run/nvidia/driver   --set gpuResourcesEnabledOverride=true   --set controller.containers.computeDomain.env[0].name=ADDITIONAL_NAMESPACES   --set controller.containers.computeDomain.env[0].value=gpu-operator  --set webhook.replicas=2 --set webhook.enabled=true
Error: INSTALLATION FAILED: template: nvidia-dra-driver-gpu/templates/kubeletplugin.yaml:110:31: executing "nvidia-dra-driver-gpu/templates/kubeletplugin.yaml" at <.Values.kubeletPlugin.containers.plugin.healthcheckPort>: nil pointer evaluating interface {}.healthcheckPort

@klueska klueska moved this from Backlog to Closed in Planning Board: k8s-dra-driver-gpu Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Development

Successfully merging this pull request may close these issues.

3 participants