Use per-component matchLabels selector in pod specs (fix #355) #359

jgehrcke · 2025-05-14T13:39:59Z

A pragmatic proposal for addressing #355.

Consider a scenario with one controller and four plugin pods:

$ kubectl get pods -A | grep dra-driver
nvidia-dra-driver-gpu-6894   gpu-operator-node-feature-discovery-prune-fcxrx               0/1     Completed   0             57d
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-controller-9f5846bdb-bh2vq              1/1     Running     0             3h45m
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-dk7hp                    1/1     Running     0             3h45m
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-lmhhn                    1/1     Running     0             3h45m
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-mvmc7                    1/1     Running     0             3h45m
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-t2ddp                    1/1     Running     0             3h45m

Without the patch, the following type/name selector for the controller yields all pods, and we after all do not see controller logs:

$ kubectl logs -n nvidia-dra-driver-gpu deploy/nvidia-dra-driver-gpu-controller 
Found 5 pods, using pod/nvidia-dra-driver-gpu-kubelet-plugin-npxss
...

With the patch, the same command yields only controller logs, as expected:

$ kubectl logs kubectl logs -n nvidia-dra-driver-gpu deploy/nvidia-dra-driver-gpu-controller
I0514 09:42:42.216011       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0514 09:42:42.216127       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
...

With the patch, the following command refers to just the plugin pods (as expected):

$ kubectl logs ds/nvidia-dra-driver-gpu-kubelet-plugin -n nvidia-dra-driver-gpu | head -n2
Found 4 pods, using pod/nvidia-dra-driver-gpu-kubelet-plugin-dk7hp
I0514 09:42:42.104011       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true

(without the patch, this may erroneously yield controller logs as shown in #355).

Before the change, these are the labels for the controller selector:

$ kubectl describe deployment -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-controller        
...
Selector:               app.kubernetes.io/instance=nvidia-dra-driver-gpu-1747137149,app.kubernetes.io/name=nvidia-dra-driver-gpu
...

I would love for others to chime in, but as far as I now understand the purpose of this selector the current value is wrong as it resolves to all pods deployed by the NVIDIA DRA Driver for GPUs. Instead, we want it to resolve to just the controller pods.

After the change:

$ kubectl describe deployment -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-controller| grep Selector
Selector:               nvidia-dra-driver-gpu-component=controller

CC @nojnhuh

copy-pr-bot · 2025-05-14T13:40:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

elezar · 2025-05-14T14:12:25Z

deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml

        {{- toYaml . | nindent 8 }}
      {{- end }}
      labels:
        {{- include "nvidia-dra-driver-gpu.templateLabels" . | nindent 8 }}


Note that templateLabels still includes selectorLabels. Is this intended now that selectorLabels have been removed from selector.matchLabels above?

Note that templateLabels still includes selectorLabels

Yes. That's fine.

I wanted to be minimally invasive. To me, that first and foremost meant that I did not want to remove any labels.

~~In the longer run, I want us to question the purpose of "selectorLabels" defined in our helm chart. What was the premise behind it? I assume it may be ill-posed in our case here~~

edit: okay, changed to using the selectorLabels approach again (now it's a dynamic label)

another update:

Note that templateLabels still includes selectorLabels

not anymore, see commentary.

Thanks: I think making selectorLabels more specific (and a reduced set is correct).

klueska · 2025-05-14T14:21:13Z

I'd rather see something that still used the selectorLabels / templateLabels, but maybe it takes an optional input parameter to the template function where we can insert a component-type into all the labels.

jgehrcke · 2025-05-14T15:20:21Z

maybe it takes an optional input parameter to the template function where we can insert a component-type into all the labels

Can look into that. Are you ok with the rendered outcome, though? That is, with nvidia-dra-driver-gpu-component: controller and nvidia-dra-driver-gpu-component: kubelet-plugin.

jgehrcke · 2025-05-14T16:38:02Z

Changed the approach to make the selectorLabels() function consume the component name. Tested:

nvidia@SC-Starwars-MAB5-B00:k8s-dra-driver-gpu$ kubectl logs deploy/nvidia-dra-driver-gpu-controller -n nvidia-dra-driver-gpu | head -n2
I0514 16:32:56.234618       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0514 16:32:56.234668       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
nvidia@SC-Starwars-MAB5-B00:k8s-dra-driver-gpu$ kubectl logs -n nvidia-dra-driver-gpu ds/nvidia-dra-driver-gpu-kubelet-plugin | head -n2
Found 4 pods, using pod/nvidia-dra-driver-gpu-kubelet-plugin-fjw78
I0514 16:32:56.126250       1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0514 16:32:56.126366       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true

jgehrcke · 2025-05-14T16:39:11Z

deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

-{{- if .Values.selectorLabelsOverride -}}
-{{ toYaml .Values.selectorLabelsOverride }}
-{{- else -}}
-{{ include "nvidia-dra-driver-gpu.templateLabels" . }}


I don't think we need the template labels in the selector labels.

The current logic will write something like nvidia-dra-driver-gpu-component=controller into the matchLabels selector. That is precise, and not conflicting (unless one wants it to conflict with something).

Note that the nvidia-dra-driver-gpu.templateLabels helper above also pulls in Values.selectorLabelsOverride. Here we only include the templateLabels if the override is not set -- in which case the templateLabels will NOT include the selector labels. This does seem recursive / convoluted though.

Thanks for looking deeply. Very appreciated.

I will review this more carefully tomorrow.

I want to understand use cases that we want to support, and then remove complexity that is not currently needed.

(this nifty setup will have to be documented better so that it can survive if it must survive!)

Based on what I understand today, I don't think that template labels and selector labels should be entangled.

I think template labels, as far as I understood, are just labels added to any and all component deployed by this Helm chart. That looks good, and keeps things more or less organized.

Selector labels, on the other hand, have a deep technical (and internal, for hat matter) meaning. I am not sure if we even want to let people override those.

Will think about it more tomorrow.

I went into history. NVIDIA/k8s-device-plugin@4820ac1#r157221509.

Note that the nvidia-dra-driver-gpu.templateLabels helper above also pulls in Values.selectorLabelsOverride

I think it's better to change that then, here. Will do.

It's already a challenge for me to put into English words what a good distinction is between "common labels" and "template labels". I will work on that.

For me, it's easier to see a clear distinction between [common labels, template labels] and [selector labels]. I want to reinforce that distinction.

Added a new commit with commentary declaring intent. Not perfect, but I think it's important for us to express the system that we want to use here.

jgehrcke · 2025-05-14T16:41:37Z

deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

+    nvidia-dra-driver-gpu-component: {{ .componentName }}
+    {{- end }}
+  {{- else -}}
+  fail "selectorLabels: both arguments are required: context, componentName"


I adopted this approach from some blog post: named and required function arguments, with a more or less good error message upon not providing arguments.

Note that the outer chart context in the function is now not . anymore, but .context.

Don't we love learning DSLs...and their funkiness :)

refs:
https://blog.palark.com/advanced-helm-templating/
https://jiminbyun.medium.com/understanding-the-chart-context-in-helm-templates-0524b520b120

Just to clarify: The fail here just checks that we're "invoking" this helper correctly? Meaning we need to set both the context and the component name where we include it in some other helm template.

Yes! It's like an assert statement.

jgehrcke · 2025-05-14T16:46:45Z

but maybe it takes an optional input parameter

I made it a required parameter. Because not encoding that component-specific aspect led to the previous over-selection of pods.

Working in this Helm templating logic feels rather brittle. Let's land an improvement, and then iterate later.

The public-facing aspects here of course we need to more or less settle on (the label key/value names, which people may end up using in their kubectl commands, and we don't want to break those too often :-)).

elezar · 2025-05-15T12:11:26Z

deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

-{{- end }}
+  {{- if and (hasKey . "componentName") (hasKey . "context") -}}
+    {{- if .context.Values.selectorLabelsOverride -}}
+    {{ toYaml .context.Values.selectorLabelsOverride }}


Question: Since this is indented, is that going to mess with indentation in the rendered template?

Thanks for the question.

I still don't know exactly, but based on a little bit more reading I think you might have the right hunch here and I will add a commit to dedent.

Most templating languages bring this type of challenge. They either respect or ignore the indentation in the source code. In Jinja, for example, it's ignored (so you can indent the source code to make it more readable w/o having to worry about breaking the output). That's not always the perfect choice either (which reminds me of pallets/jinja#178 (comment) from 2013).

I obviously indented for making things slightly more readable. And I could test that manually. Better: we need a test suite for our Helm chart that not only renders, but also deploys for different input parameters. Otherwise we cannot make such changes with confidence.

This question brought me to https://www.reddit.com/r/kubernetes/comments/1be15b8/helm_templating_language_was_a_mistake_90_of_helm/ which has this statement in the title:

90%+ of helm chart commits are bug fixes for indent

The discussion also brought me to https://leebriggs.co.uk/blog/2019/02/07/why-are-we-templating-yaml in which the author says

Needless to say, this isn’t what I want to spend my time doing. If fiddling around with whitespace requirements in a templating system doing something it’s not really designed for is what suits you, then I’m not going to stop you.

For the record, I had great success working with jsonnet in the past in the context of https://github.com/prometheus-operator/kube-prometheus.

Btw, the official Helm templating docs are wishywashy on the topic:

(https://helm.sh/docs/chart_best_practices/templates/)

By the way: Not sure whether the following is useful: https://github.com/NVIDIA/k8s-device-plugin/blob/main/tests/helm/helm_template_test.go

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke · 2025-05-16T11:32:09Z

deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

 app.kubernetes.io/instance: {{ .Release.Name }}
-{{- if .Values.selectorLabelsOverride }}
-{{ toYaml .Values.selectorLabelsOverride }}
-{{- end }}


Let's keep the selector label (single label, I propose) strictly independent.

"Template labels" I believe always meant "pod template labels" and I think it really always meant "apply a common set of labels to all pods of this Helm chart".

By that nature, this set of labels cannot distinguish individual groups of pods within the Helm chart.

Which is what "selector labels" want to do.

In that sense, those two concepts (template labels and selector labels) are generally incompatible. Which goes unnoticed for the special case of there being just one component (one relevant group of pods) in the Helm chart.

jgehrcke · 2025-05-16T11:36:45Z

deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml

  selector:
    matchLabels:
-      {{- include "nvidia-dra-driver-gpu.selectorLabels" . | nindent 6 }}
+      {{- include "nvidia-dra-driver-gpu.selectorLabels" (dict "context" . "componentName" "controller") | nindent 6 }}


By the way, I have now learned that this left-chomping-nindent-technique is the holy grail of Helm templates: helm/helm#3854 (comment)

jgehrcke · 2025-05-16T11:41:56Z

deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

 {{- define "nvidia-dra-driver-gpu.labels" -}}
 helm.sh/chart: {{ include "nvidia-dra-driver-gpu.chart" . }}
-{{ include "nvidia-dra-driver-gpu.templateLabels" . }}
-{{- if .Chart.AppVersion }}


we always define that (and if we don't, I want to know about it)

jgehrcke · 2025-05-16T11:42:51Z

deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

 {{- end }}

 {{/*
-Template labels


Slightly allergic to a thing named templateLabels being documented with "Template labels" :)

elezar · 2025-05-16T11:49:44Z

deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

+could be many labels, but we want to use just one (with a sufficiently
+unique key).
+
+TOOD: remove the override feature, or make the override work per-component.


Should we remove the override before GA? @klueska do we know of anyone with SPECIFIC selector labels?

elezar

Thanks @jgehrcke.

I think this looks good from my side. Note that I don't have all the context required for all possible label combinations, but this PR does clean up the selector labels specifically. It also decouples these from the other sets of labels that are defined in the helpers which is a cleaner solution to start with.

jgehrcke · 2025-05-16T14:22:36Z

Thanks for all the good feedback here. I am sure we want to do more work on this front (or even have to).

jgehrcke mentioned this pull request May 14, 2025

Pod selectors for kubelet plugin DaemonSet and controller Deployment are identical #355

Closed

elezar reviewed May 14, 2025

View reviewed changes

jgehrcke force-pushed the jp/better-selectors branch from b7c1b58 to 6a203b9 Compare May 14, 2025 16:36

jgehrcke commented May 14, 2025

View reviewed changes

jgehrcke force-pushed the jp/better-selectors branch from b7e0091 to 3042b5a Compare May 14, 2025 16:44

elezar reviewed May 15, 2025

View reviewed changes

jgehrcke force-pushed the jp/better-selectors branch from d5af042 to 83f6677 Compare May 16, 2025 11:11

Use per-component matchLabels selector in pod specs (fix NVIDIA#355)

5f6c7a5

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke force-pushed the jp/better-selectors branch from 83f6677 to 5f6c7a5 Compare May 16, 2025 11:16

jgehrcke commented May 16, 2025

View reviewed changes

elezar reviewed May 16, 2025

View reviewed changes

elezar approved these changes May 16, 2025

View reviewed changes

jgehrcke merged commit dc61b93 into NVIDIA:main May 16, 2025
7 checks passed

klueska added this to Planning Board: k8s-dra-driver-gpu Jun 16, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Jun 16, 2025

klueska added this to the v25.3.0 milestone Aug 13, 2025

Use per-component matchLabels selector in pod specs (fix #355) #359

Use per-component matchLabels selector in pod specs (fix #355) #359

Uh oh!

Conversation

jgehrcke commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klueska commented May 14, 2025

Uh oh!

jgehrcke commented May 14, 2025

Uh oh!

jgehrcke commented May 14, 2025

Uh oh!

jgehrcke May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

jgehrcke commented May 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

jgehrcke commented May 14, 2025 •

edited

Loading

jgehrcke May 14, 2025 •

edited

Loading

jgehrcke May 14, 2025 •

edited

Loading

jgehrcke May 15, 2025 •

edited

Loading

jgehrcke commented May 14, 2025 •

edited

Loading

jgehrcke May 15, 2025 •

edited

Loading

jgehrcke May 16, 2025 •

edited

Loading

jgehrcke May 16, 2025 •

edited

Loading