Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented May 14, 2025

A pragmatic proposal for addressing #355.

Consider a scenario with one controller and four plugin pods:

$ kubectl get pods -A | grep dra-driver
nvidia-dra-driver-gpu-6894   gpu-operator-node-feature-discovery-prune-fcxrx               0/1     Completed   0             57d
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-controller-9f5846bdb-bh2vq              1/1     Running     0             3h45m
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-dk7hp                    1/1     Running     0             3h45m
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-lmhhn                    1/1     Running     0             3h45m
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-mvmc7                    1/1     Running     0             3h45m
nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-t2ddp                    1/1     Running     0             3h45m

Without the patch, the following type/name selector for the controller yields all pods, and we after all do not see controller logs:

$ kubectl logs -n nvidia-dra-driver-gpu deploy/nvidia-dra-driver-gpu-controller 
Found 5 pods, using pod/nvidia-dra-driver-gpu-kubelet-plugin-npxss
...

With the patch, the same command yields only controller logs, as expected:

$ kubectl logs kubectl logs -n nvidia-dra-driver-gpu deploy/nvidia-dra-driver-gpu-controller
I0514 09:42:42.216011       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0514 09:42:42.216127       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
...

With the patch, the following command refers to just the plugin pods (as expected):

$ kubectl logs ds/nvidia-dra-driver-gpu-kubelet-plugin -n nvidia-dra-driver-gpu | head -n2
Found 4 pods, using pod/nvidia-dra-driver-gpu-kubelet-plugin-dk7hp
I0514 09:42:42.104011       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true

(without the patch, this may erroneously yield controller logs as shown in #355).

Before the change, these are the labels for the controller selector:

$ kubectl describe deployment -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-controller        
...
Selector:               app.kubernetes.io/instance=nvidia-dra-driver-gpu-1747137149,app.kubernetes.io/name=nvidia-dra-driver-gpu
...

I would love for others to chime in, but as far as I now understand the purpose of this selector the current value is wrong as it resolves to all pods deployed by the NVIDIA DRA Driver for GPUs. Instead, we want it to resolve to just the controller pods.

After the change:

$ kubectl describe deployment -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-controller| grep Selector
Selector:               nvidia-dra-driver-gpu-component=controller

CC @nojnhuh

@copy-pr-bot
Copy link

copy-pr-bot bot commented May 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "nvidia-dra-driver-gpu.templateLabels" . | nindent 8 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that templateLabels still includes selectorLabels. Is this intended now that selectorLabels have been removed from selector.matchLabels above?

Copy link
Collaborator Author

@jgehrcke jgehrcke May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that templateLabels still includes selectorLabels

Yes. That's fine.

I wanted to be minimally invasive. To me, that first and foremost meant that I did not want to remove any labels.

In the longer run, I want us to question the purpose of "selectorLabels" defined in our helm chart. What was the premise behind it? I assume it may be ill-posed in our case here

edit: okay, changed to using the selectorLabels approach again (now it's a dynamic label)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another update:

Note that templateLabels still includes selectorLabels

not anymore, see commentary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks: I think making selectorLabels more specific (and a reduced set is correct).

@klueska
Copy link
Collaborator

klueska commented May 14, 2025

I'd rather see something that still used the selectorLabels / templateLabels, but maybe it takes an optional input parameter to the template function where we can insert a component-type into all the labels.

@jgehrcke
Copy link
Collaborator Author

maybe it takes an optional input parameter to the template function where we can insert a component-type into all the labels

Can look into that. Are you ok with the rendered outcome, though? That is, with nvidia-dra-driver-gpu-component: controller and nvidia-dra-driver-gpu-component: kubelet-plugin.

@jgehrcke jgehrcke force-pushed the jp/better-selectors branch from b7c1b58 to 6a203b9 Compare May 14, 2025 16:36
@jgehrcke
Copy link
Collaborator Author

Changed the approach to make the selectorLabels() function consume the component name. Tested:

nvidia@SC-Starwars-MAB5-B00:k8s-dra-driver-gpu$ kubectl logs deploy/nvidia-dra-driver-gpu-controller -n nvidia-dra-driver-gpu | head -n2
I0514 16:32:56.234618       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0514 16:32:56.234668       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
nvidia@SC-Starwars-MAB5-B00:k8s-dra-driver-gpu$ kubectl logs -n nvidia-dra-driver-gpu ds/nvidia-dra-driver-gpu-kubelet-plugin | head -n2
Found 4 pods, using pod/nvidia-dra-driver-gpu-kubelet-plugin-fjw78
I0514 16:32:56.126250       1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0514 16:32:56.126366       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true

{{- if .Values.selectorLabelsOverride -}}
{{ toYaml .Values.selectorLabelsOverride }}
{{- else -}}
{{ include "nvidia-dra-driver-gpu.templateLabels" . }}
Copy link
Collaborator Author

@jgehrcke jgehrcke May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the template labels in the selector labels.

The current logic will write something like nvidia-dra-driver-gpu-component=controller into the matchLabels selector. That is precise, and not conflicting (unless one wants it to conflict with something).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the nvidia-dra-driver-gpu.templateLabels helper above also pulls in Values.selectorLabelsOverride. Here we only include the templateLabels if the override is not set -- in which case the templateLabels will NOT include the selector labels. This does seem recursive / convoluted though.

Copy link
Collaborator Author

@jgehrcke jgehrcke May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking deeply. Very appreciated.

I will review this more carefully tomorrow.

I want to understand use cases that we want to support, and then remove complexity that is not currently needed.

(this nifty setup will have to be documented better so that it can survive if it must survive!)

Based on what I understand today, I don't think that template labels and selector labels should be entangled.

I think template labels, as far as I understood, are just labels added to any and all component deployed by this Helm chart. That looks good, and keeps things more or less organized.

Selector labels, on the other hand, have a deep technical (and internal, for hat matter) meaning. I am not sure if we even want to let people override those.

Will think about it more tomorrow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went into history. NVIDIA/k8s-device-plugin@4820ac1#r157221509.

Note that the nvidia-dra-driver-gpu.templateLabels helper above also pulls in Values.selectorLabelsOverride

I think it's better to change that then, here. Will do.

It's already a challenge for me to put into English words what a good distinction is between "common labels" and "template labels". I will work on that.

For me, it's easier to see a clear distinction between [common labels, template labels] and [selector labels]. I want to reinforce that distinction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new commit with commentary declaring intent. Not perfect, but I think it's important for us to express the system that we want to use here.

nvidia-dra-driver-gpu-component: {{ .componentName }}
{{- end }}
{{- else -}}
fail "selectorLabels: both arguments are required: context, componentName"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adopted this approach from some blog post: named and required function arguments, with a more or less good error message upon not providing arguments.

Note that the outer chart context in the function is now not . anymore, but .context.

Don't we love learning DSLs...and their funkiness :)

refs:
https://blog.palark.com/advanced-helm-templating/
https://jiminbyun.medium.com/understanding-the-chart-context-in-helm-templates-0524b520b120

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: The fail here just checks that we're "invoking" this helper correctly? Meaning we need to set both the context and the component name where we include it in some other helm template.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It's like an assert statement.

@jgehrcke jgehrcke force-pushed the jp/better-selectors branch from b7e0091 to 3042b5a Compare May 14, 2025 16:44
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented May 14, 2025

but maybe it takes an optional input parameter

I made it a required parameter. Because not encoding that component-specific aspect led to the previous over-selection of pods.

Working in this Helm templating logic feels rather brittle. Let's land an improvement, and then iterate later.

The public-facing aspects here of course we need to more or less settle on (the label key/value names, which people may end up using in their kubectl commands, and we don't want to break those too often :-)).

{{- end }}
{{- if and (hasKey . "componentName") (hasKey . "context") -}}
{{- if .context.Values.selectorLabelsOverride -}}
{{ toYaml .context.Values.selectorLabelsOverride }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Since this is indented, is that going to mess with indentation in the rendered template?

Copy link
Collaborator Author

@jgehrcke jgehrcke May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question.

I still don't know exactly, but based on a little bit more reading I think you might have the right hunch here and I will add a commit to dedent.

Most templating languages bring this type of challenge. They either respect or ignore the indentation in the source code. In Jinja, for example, it's ignored (so you can indent the source code to make it more readable w/o having to worry about breaking the output). That's not always the perfect choice either (which reminds me of pallets/jinja#178 (comment) from 2013).

I obviously indented for making things slightly more readable. And I could test that manually. Better: we need a test suite for our Helm chart that not only renders, but also deploys for different input parameters. Otherwise we cannot make such changes with confidence.

This question brought me to https://www.reddit.com/r/kubernetes/comments/1be15b8/helm_templating_language_was_a_mistake_90_of_helm/ which has this statement in the title:

90%+ of helm chart commits are bug fixes for indent

The discussion also brought me to https://leebriggs.co.uk/blog/2019/02/07/why-are-we-templating-yaml in which the author says

Needless to say, this isn’t what I want to spend my time doing. If fiddling around with whitespace requirements in a templating system doing something it’s not really designed for is what suits you, then I’m not going to stop you.

For the record, I had great success working with jsonnet in the past in the context of https://github.com/prometheus-operator/kube-prometheus.

Btw, the official Helm templating docs are wishywashy on the topic:
image
(https://helm.sh/docs/chart_best_practices/templates/)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgehrcke jgehrcke force-pushed the jp/better-selectors branch from d5af042 to 83f6677 Compare May 16, 2025 11:11
@jgehrcke jgehrcke force-pushed the jp/better-selectors branch from 83f6677 to 5f6c7a5 Compare May 16, 2025 11:16
app.kubernetes.io/instance: {{ .Release.Name }}
{{- if .Values.selectorLabelsOverride }}
{{ toYaml .Values.selectorLabelsOverride }}
{{- end }}
Copy link
Collaborator Author

@jgehrcke jgehrcke May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the selector label (single label, I propose) strictly independent.

"Template labels" I believe always meant "pod template labels" and I think it really always meant "apply a common set of labels to all pods of this Helm chart".

By that nature, this set of labels cannot distinguish individual groups of pods within the Helm chart.

Which is what "selector labels" want to do.

In that sense, those two concepts (template labels and selector labels) are generally incompatible. Which goes unnoticed for the special case of there being just one component (one relevant group of pods) in the Helm chart.

selector:
matchLabels:
{{- include "nvidia-dra-driver-gpu.selectorLabels" . | nindent 6 }}
{{- include "nvidia-dra-driver-gpu.selectorLabels" (dict "context" . "componentName" "controller") | nindent 6 }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I have now learned that this left-chomping-nindent-technique is the holy grail of Helm templates: helm/helm#3854 (comment)

{{- define "nvidia-dra-driver-gpu.labels" -}}
helm.sh/chart: {{ include "nvidia-dra-driver-gpu.chart" . }}
{{ include "nvidia-dra-driver-gpu.templateLabels" . }}
{{- if .Chart.AppVersion }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always define that (and if we don't, I want to know about it)

{{- end }}

{{/*
Template labels
Copy link
Collaborator Author

@jgehrcke jgehrcke May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly allergic to a thing named templateLabels being documented with "Template labels" :)

could be many labels, but we want to use just one (with a sufficiently
unique key).
TOOD: remove the override feature, or make the override work per-component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the override before GA? @klueska do we know of anyone with SPECIFIC selector labels?

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jgehrcke.

I think this looks good from my side. Note that I don't have all the context required for all possible label combinations, but this PR does clean up the selector labels specifically. It also decouples these from the other sets of labels that are defined in the helpers which is a cleaner solution to start with.

@jgehrcke
Copy link
Collaborator Author

Thanks for all the good feedback here. I am sure we want to do more work on this front (or even have to).

@jgehrcke jgehrcke merged commit dc61b93 into NVIDIA:main May 16, 2025
7 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants