Skip to content

Conversation

@Gacko
Copy link
Contributor

@Gacko Gacko commented Oct 29, 2025

Hello everyone!

This is my first contribution to this repository, so please feel free to point me to any requirement for contributions I might have missed.

This change in particular adds support for finding the nvidia-smi binary on distributions which have it installed to /opt/bin on the host. Flatcar Linux for example offers an NVIDIA sysext which installs the binaries and drivers when run on a GPU instance. This way you do not need to provide them via a DaemonSet or sidecar container.

Sadly the current approach of searching for the nvidia-smi binary does not cover /opt/bin. Actually the Flatcar NVIDIA sysext creates a symlink /usr/bin/nvidia-smi, but this is pointing to /opt/bin/nvidia-smi. So when either the kubelet plugin pre-start script or the kubelet plugin itself look for the nvidia-smi in /driver-root/usr/bin, they only find a symlink pointing to outside the host filesystem mounted below /driver-root and therefore fail to access the actual binary.

My change adds /opt/bin to the search directories of the pre-start script and both kubelet plugins, so they can find the real binary there before looking for it in /usr/bin where they would only find a dead end symlink.

Additionally I fixed a minor bug in the pre-start script: /sbin was duplicated there, while it most likely should be /usr/sbin and /sbin. Also I re-ordered the paths to be aligned with the kubelet plugins.

Thanks for reviewing my PR in advance
Marco

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Gacko
Copy link
Contributor Author

Gacko commented Oct 29, 2025

The force-push was meant to sign-off the commit and satisfy CI.

@jgehrcke
Copy link
Collaborator

This is my first contribution to this repository

Sweet!

they only find a symlink pointing to outside the host filesystem mounted below /driver-root and therefore fail to access the actual binary.

Thanks for the precise description.

My change adds /opt/bin to the search directories of the pre-start script and both kubelet plugins, so they can find the real binary there

My most important question is: did you test that this works -- did this lead to a (from your point of view) functional state of the DRA driver? Which plugin did you try?

If your answer is that "yes, I could allocate GPUs and/or ComputeDomains that way" I am very happy to take your word for it and land this. As you can imagine, testing the different path / placement variants systematically is a bit of a hassle.

Additionally I fixed a minor bug in the pre-start script: /sbin was duplicated there, while it most likely should be /usr/sbin and /sbin

Thanks for noticing.

Also I re-ordered the paths to be aligned with the kubelet plugins.

👍 :)

@jgehrcke
Copy link
Collaborator

jgehrcke commented Oct 29, 2025

Three more points, if you allow :)

  1. What's your involvement with Flatcar container Linux?

  2. A curious question I have: where is libnvidia-ml.so.1 placed in your case?

  3. Can you maybe change the commit message to kubelet plugins: add /opt/bin to binary search paths?

@jgehrcke
Copy link
Collaborator

jgehrcke commented Oct 29, 2025

Question to us: could this (considering /opt) in any way be a security concern?

@Gacko
Copy link
Contributor Author

Gacko commented Oct 30, 2025

Good morning @jgehrcke!

Thank you for your fast response - and sorry for my delayed one, I wasn't expecting that!

My most important question is: did you test that this works -- did this lead to a (from your point of view) functional state of the DRA driver? Which plugin did you try?

If your answer is that "yes, I could allocate GPUs and/or ComputeDomains that way" I am very happy to take your word for it and land this. As you can imagine, testing the different path / placement variants systematically is a bit of a hassle.

It took me some time to get the GitHub Actions running in my fork since you hard-coded NVIDIA in some places, which isn't a lot useful when trying to publish the image and the chart to your own registry, but finally I got everything up and running. 😅

Since I was initially confronted with the kubelet plugin pre-start script not being able to find the nvidia-smi binary, the scope of this PR is limited to adapting the search behavior. If at any point I am confronted with other problems, I will come up with an issue and, if I am able to fix it, another PR.

This said I can tell: Yes, I tested my change and with it the pre-start script is able to find the nvidia-smi binary and the libnvidia-ml.so.1 library. Also the GPU kubelet plugin I am currently using is coming up fine, even though I had to add a NetworkPolicy to allow access to the API server. I haven't tested anything beyond that, yet, so also cannot assure the plugin itself to be working as expected. Also I haven't tested the Compute Domains kubelet plugin, as this isn't required for our use case, but I will do so later.

Following you can find the container logs of both the pre-start script and the GPU kubelet plugin. This will also already give you some details about my environment:

kubelet plugin pre-start script
create symlink: /driver-root -> /driver-root-parent/
2025-10-30T08:15:49Z  /driver-root (/ on host): nvidia-smi: '/driver-root/opt/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib64/libnvidia-ml.so.1', current contents: [].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib64/libnvidia-ml.so.1 /driver-root/opt/bin/nvidia-smi
Thu Oct 30 08:15:49 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: N/A      |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P8              14W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
nvidia-smi returned with code 0: success, leave
GPU kubelet plugin
I1030 08:15:50.484600       1 utils.go:43] 
Feature gates: map[string]bool{"AllAlpha":false, "AllBeta":false, "ContextualLogging":true, "IMEXDaemonsWithDNSNames":true, "LoggingAlphaOptions":false, "LoggingBetaOptions":true, "MPSSupport":false, "TimeSlicingSettings":false}
Verbosity: 4
Flags: (*main.Flags)({
  kubeClientConfig: (flags.KubeClientConfig) {
    KubeConfig: (string) "",
    KubeAPIQPS: (float64) 5,
    KubeAPIBurst: (int) 10
  },
  nodeName: (string) (len=41) "ip-10-0-90-166.eu-west-2.compute.internal",
  namespace: (string) (len=11) "kube-system",
  cdiRoot: (string) (len=12) "/var/run/cdi",
  containerDriverRoot: (string) (len=12) "/driver-root",
  hostDriverRoot: (string) (len=1) "/",
  nvidiaCDIHookPath: (string) "",
  imageName: (string) (len=54) "ghcr.io/gacko/k8s-dra-driver-gpu:v25.12.0-dev-426973ab",
  kubeletRegistrarDirectoryPath: (string) (len=33) "/var/lib/kubelet/plugins_registry",
  kubeletPluginsDirectoryPath: (string) (len=24) "/var/lib/kubelet/plugins",
  healthcheckPort: (int) 51516
})
I1030 08:15:50.485155       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I1030 08:15:50.485174       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I1030 08:15:50.485178       1 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I1030 08:15:50.485184       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true
I1030 08:15:50.485189       1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I1030 08:15:50.507678       1 device_state.go:71] using devRoot=/driver-root
I1030 08:15:50.592389       1 draplugin.go:597] "Starting"
I1030 08:15:50.592764       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/gpu.nvidia.com/dra.sock"
I1030 08:15:50.593000       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/gpu.nvidia.com-reg.sock"
I1030 08:15:50.595998       1 resourceslicecontroller.go:528] "Starting ResourceSlice informer and waiting for it to sync" logger="ResourceSlice controller"
I1030 08:15:50.596012       1 health.go:100] starting healthcheck service at [::]:51516
I1030 08:15:50.596092       1 reflector.go:358] "Starting reflector" type="*v1.ResourceSlice" resyncPeriod="0s" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:534"
I1030 08:15:50.596110       1 reflector.go:404] "Listing and watching" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:534"
I1030 08:15:50.602545       1 reflector.go:436] "Caches populated" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:534"
I1030 08:15:51.596801       1 resourceslicecontroller.go:543] "ResourceSlice informer has synced" logger="ResourceSlice controller"
I1030 08:15:51.596851       1 resourceslicecontroller.go:184] "Starting" logger="ResourceSlice controller"

Three more points, if you allow :)

Sure, let's go!

  1. What's your involvement with Flatcar container Linux?

I am working for Giant Swarm and our Kubernetes distribution is based on Flatcar Linux. You didn't ask this, but since I'm also a maintainer I feel like I need to clarify this, since it's a thought that often comes to my mind when people are creating pull requests to Ingress NGINX: I think I could have addressed this issue at Flatcar, specifically in their sysext-bakery project where the binaries and drivers we are using most likely come from, but since other Linux distributions could also use /opt for placing binaries and/or libraries, it seemed more sustainable to me to change this on your end.

Overall we are currently trying to get our product compliant to the Kubernetes AI Conformance Tests. In Kubernetes v1.34+ these require the DRA feature to work and currently I am testing the NVIDA DRA Driver on a Kubernetes v1.33 cluster on AWS, specifically g4dn instances with Tesla T4 Tensor GPUs, with the DRA Feature Gate enabled.

  1. A curious question I have: where is libnvidia-ml.so.1 placed in your case?

I think you can see this from the kublet plugin pre-start script logs above: /driver-root/usr/lib64/libnvidia-ml.so.1 in the container or /usr/lib64/libnvidia-ml.so.1 on the host.

Interestingly the libraries get provided as real files. When the NVIDIA runtime sysext in Flatcar is enabled, it is mounting them directly to /usr/lib and /opt/nvidia/current/usr/lib, so no symlinks. The only file being provided as a symlink is the nvidia-smi binary and this is then pointing to /opt/bin/nvidia-smi.

  1. Can you maybe change the commit message to kubelet plugins: add /opt/bin to binary search paths?

Sure, will do so in a second!

As said above, I will now also try the Compute Domains plugin, but I cannot promise it's working in our use case. Also I will try to get some actual workload allocated to the instances with the DRA Driver plugin enabled.

Marco

@Gacko
Copy link
Contributor Author

Gacko commented Oct 30, 2025

I changed the commit message.

@Gacko Gacko changed the title kubelet plugin: Improve nvidia-smi search. kubelet plugins: add /opt/bin to binary search paths Oct 30, 2025
@Gacko
Copy link
Contributor Author

Gacko commented Oct 30, 2025

I just tested the Compute Domains kubelet plugin: It's basically coming up, but fails to find nvidia-caps-imex-channels device in /proc/devices, most likely because the instance type I am using isn't supporting it.

So best I can do for now is this:

I1030 12:58:33.463792       1 utils.go:43] 
Feature gates: map[string]bool{"AllAlpha":false, "AllBeta":false, "ContextualLogging":true, "IMEXDaemonsWithDNSNames":true, "LoggingAlphaOptions":false, "LoggingBetaOptions":true, "MPSSupport":false, "TimeSlicingSettings":false}
Verbosity: 4
Flags: (*main.Flags)({
  kubeClientConfig: (flags.KubeClientConfig) {
    KubeConfig: (string) "",
    KubeAPIQPS: (float64) 5,
    KubeAPIBurst: (int) 10
  },
  nodeName: (string) (len=41) "ip-10-0-103-42.eu-west-2.compute.internal",
  namespace: (string) (len=11) "kube-system",
  cdiRoot: (string) (len=12) "/var/run/cdi",
  containerDriverRoot: (string) (len=12) "/driver-root",
  hostDriverRoot: (string) (len=1) "/",
  nvidiaCDIHookPath: (string) "",
  kubeletRegistrarDirectoryPath: (string) (len=33) "/var/lib/kubelet/plugins_registry",
  kubeletPluginsDirectoryPath: (string) (len=24) "/var/lib/kubelet/plugins",
  healthcheckPort: (int) 51515
})
I1030 12:58:33.464180       1 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I1030 12:58:33.464197       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true
I1030 12:58:33.464201       1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I1030 12:58:33.464205       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I1030 12:58:33.464208       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I1030 12:58:33.473269       1 main.go:181] shutdown
Error: error creating driver: failed to create device library: error getting nvcap for IMEX channel '0': error getting device major: error parsing '/proc/devices': unexpected regex match: []

@Gacko
Copy link
Contributor Author

Gacko commented Oct 30, 2025

It's currently failing at this point: https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/cmd/compute-domain-kubelet-plugin/nvlib.go#L102.

The Compute Domains kubelet plugin already loaded the driver and found the nvidia-smi binary. So we should be good even if I couldn't test it beyond that.

@Gacko
Copy link
Contributor Author

Gacko commented Oct 30, 2025

I just tested the Compute Domains kubelet plugin: It's basically coming up, but fails to find nvidia-caps-imex-channels device in /proc/devices, most likely because the instance type I am using isn't supporting it.

Ok, after some reading I think I got it. It's just not possible to implement this on AWS with normal instance types. Took me some time to understand the IMEX channels, but from what I understood these require the GPUs to be physically connected and configured, which is clearly not possible on AWS. Or at least only with their specific p6e-gb200 instance type.

@jgehrcke
Copy link
Collaborator

jgehrcke commented Nov 3, 2025

Overall we are currently trying to get our product compliant to the Kubernetes AI Conformance Tests. In Kubernetes v1.34+ these require the DRA feature to work and currently I am testing the NVIDA DRA Driver on a Kubernetes v1.33 cluster on AWS, specifically g4dn instances with Tesla T4 Tensor GPUs, with the DRA Feature Gate enabled.

Lovely. Thank you for the background!

Interestingly the libraries get provided as real files. When the NVIDIA runtime sysext in Flatcar is enabled, it is mounting them directly to /usr/lib and /opt/nvidia/current/usr/lib, so no symlinks. The only file being provided as a symlink is the nvidia-smi binary and this is then pointing to /opt/bin/nvidia-smi.

Well, as you said -- this may also be worth changing over there. But whatever is least friction here I think is a great path forward.

It took me some time to get the GitHub Actions running in my fork since you hard-coded NVIDIA in some places,

Also the GPU kubelet plugin I am currently using is coming up fine

I will now also try the Compute Domains plugin, but I cannot promise it's working in our use case.

Oh wow, you really went through a bunch of effort :-). I did not mean to imply that you have to do all these things. I was merely trying to ask what you did in terms of verification / testing, so that I can better understand what we still have to do here. But big thank you for trying to do a bunch of verification work. You seemingly also have learned some things along the way about the funky intrinsics of our DRA driver here, the (somewhat) lacking purpose of our CI, and the fact that (yes) meaningful ComputeDomain plugin testing can only be done in environments that actually support multi-node NVLink :)

I'd say -- given the effort that went into this, I'd just love to a head and merge this now.

I still want to understand if we have potential security challenges down the line by generally (always) looking into /opt. Maybe @elezar has an opinion here.

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Marco for your great communication around all those details here.

I hope this can stay in the code base in the longer run. We will have to see if this should maybe be back-ported to the 25.8.x branch.

@jgehrcke jgehrcke merged commit 852b56f into NVIDIA:main Nov 3, 2025
1 check passed
@Gacko Gacko deleted the vkptt branch November 4, 2025 06:05
@Gacko
Copy link
Contributor Author

Gacko commented Nov 4, 2025

Thanks for merging this - and also considering back-porting!

Well, as you said -- this may also be worth changing over there. But whatever is least friction here I think is a great path forward.

Actually I am currently investigating this and trying to get a deeper understanding. For now I learnt that Flatcar is using systemd-sysext to provide additional components via mounting them from images or directories as overlay filesystem into /usr and so on.

Some like Docker or containerd are pre-installed and just need to get mounted. Others can be enabled via config and then get downloaded at boot time. And that's already the issue: If you would enable the officially provided NVIDIA Drivers sysext, they would get downloaded and mounted to /usr on boot - but to each and every instance, since there is no "is this instance having GPUs?" logic.

But: This logic actually exists in their nvidia systemd service. This service is capable of detecting an NVIDIA GPU at runtime and only then downloading the drivers. But sadly it places the nvidia-smi binary in /opt/bin and the drivers under /opt/nvidia/current/usr/lib64, where current is a symlink to actual driver directory (e.g. "535.230.02/6.6.106-flatcar", so /opt/nvidia/535.230.02/6.6.106-flatcar).

At this point neither the drivers nor the nvidia-smi binary is available in /usr. So the dynamic driver installation I described above puts everything to /opt.

But where do the libraries and the symlink I see in /usr come from? Well, that's tricky... we are also using the NVIDIA Runtime sysext. And this comes with this little file. This script is basically responsible for creating the symlink I see and mounting the whole /opt/nvidia/current as an overlay filesystem to /, with which it provides everything in /opt/nvidia/current/usr also to /usr. 🤦

tl;dr:

  1. With just the Flatcar NVIDIA Drivers sysext enabled we would not need this change here, because the sysext already mounts everything to /usr. The nvidia-smi would be a real file and not just a symlink. Downside: Each and every instance, independent of it having a GPU or not, always downloads and installs the drivers, meaning increased boot time for downloading and installing the sysext and increased disk space usage (and maybe security risk?) for always having the drivers and binaries installed.
  2. With just the dynamically and at boot time installed drivers (if their is a GPU), we can save bandwidth and disk space, but get the binary placed to just /opt/bin (no symlink) and the libraries placed to /opt/nvidia/current/usr/lib64. In this case my change would detect the binary, but fail to detect the library as it only exists in /opt/nvidia/current/usr/lib64 which isn't covered by the current search logic. And changing the driver root to /opt/nvidia/current via Helm values would enable the plugin to find the libraries, but no longer the binaries.
  3. So last but not least my change only helps when using the Flatcar NVIDIA Runtime sysext in conjunction with the dynamically downloaded drivers as the former makes sure everything is available either in /opt/bin or /usr/lib64 (see the script I linked).

What does this mean to this PR?

Basically I think it's not a bad idea to have /opt/bin on the search path, but I agree it only fixes an edge case. Also I agree it should not be required, especially for security reasons.

  • People using the Flatcar NVIDIA Drivers sysext (no. 1 in the listing) do not have issues anyway - with or without this change - even though they might waste bandwidth and disk.
  • People using the dynamically installed drivers do not waste bandwidth or disk, but also this change does not help them. They most likely don't get the DRA plugin up and running, no matter what they set the driver root to.
  • People using the dynamically installed drivers AND the Flatcar NVIDIA Runtime sysext are lucky, because my change makes the plugin work for them.

@Gacko
Copy link
Contributor Author

Gacko commented Nov 4, 2025

Ok, wow... I just wasted like two hours or so. The Flatcar NVIDIA Drivers sysext is not yet available on their stable release channel and I was trying to enable it on... stable.

So ultimately the first options of my listings are not applicable and therefore this change here seems to be the only way to make it work. 😅

asymingt added a commit to asymingt/k8s-dra-driver-gpu that referenced this pull request Nov 14, 2025
commit 55fc7b0
Merge: 5443e0f ef23484
Author: Shiva Krishna Merla <[email protected]>
Date:   Thu Nov 6 16:04:43 2025 -0800

    Merge pull request NVIDIA#668 from varunrsekar/vfio-support-1.33

    Support VFIO passthrough

commit ef23484
Author: Varun Ramachandra Sekar <[email protected]>
Date:   Tue Oct 14 17:29:18 2025 -0700

    vfio passthrough support

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    use chroot to run modprobe

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    deadvertise sibling devices on preparation

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    soft check for VFs before attempting unbind

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    address review comments

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    address comments (2)

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    use fuser to check if gpu is free

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    remove unnecessary securityContext

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    don't mix vfio and mig devices

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

commit 5443e0f
Merge: 59d775b 3babfe5
Author: Shiva Krishna Merla <[email protected]>
Date:   Tue Nov 4 12:48:00 2025 -0800

    Merge pull request NVIDIA#711 from shivamerla/add_gpu_stress_tests

    tests: Add separate targets for GPU plugin tests + add stress tests

commit 3babfe5
Author: Shiva Krishna, Merla <[email protected]>
Date:   Tue Nov 4 11:47:01 2025 -0800

    tests: Use BATS_TEST_TMPDIR and failfast on errors during cleanup

    Signed-off-by: Shiva Krishna, Merla <[email protected]>

commit 2b3e70b
Author: Shiva Krishna, Merla <[email protected]>
Date:   Tue Nov 4 11:07:19 2025 -0800

    tests: Add separate targets for GPU plugin tests + add stress tests

    * Add separate make targets to run GPU and CD specific tests
    * Add a stress test for GPU allocation
    * Refactor Makefile to share common docker setup between targets

    Signed-off-by: Shiva Krishna, Merla <[email protected]>

commit 59d775b
Merge: 852b56f 1e79179
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Nov 3 19:38:02 2025 +0100

    Merge pull request NVIDIA#709 from jgehrcke/jp/basic-gpu-tests

    tests: cover basic GPU allocation, misc improvements

commit 852b56f
Merge: 1ee1b4a e8fa8e6
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Nov 3 19:21:53 2025 +0100

    Merge pull request NVIDIA#706 from Gacko/vkptt

    kubelet plugins: add /opt/bin to binary search paths

commit 1ee1b4a
Merge: f4d11e3 068bb76
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Nov 3 19:10:16 2025 +0100

    Merge pull request NVIDIA#710 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.1-dev

    build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container

commit 1e79179
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Nov 1 11:44:03 2025 -0700

    tests: cover basic GPU allocation

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    misc fixes

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    remove cdi spec removal again

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 068bb76
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Nov 3 17:59:31 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.2.1-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit fcd74d1
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Nov 1 11:42:35 2025 -0700

    tests: add nvmm helper

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 977f421
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Nov 1 11:42:10 2025 -0700

    tests: per-user tmp dir (relevant on shared machines)

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 1c2da2c
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Nov 1 11:41:09 2025 -0700

    tests: parallelize per-node state dir cleanup

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e8fa8e6
Author: Marco Ebert <[email protected]>
Date:   Wed Oct 29 09:52:34 2025 +0100

    kubelet plugins: add /opt/bin to binary search paths

    Signed-off-by: Marco Ebert <[email protected]>

commit f4d11e3
Merge: 89c8258 9b20929
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 29 13:09:19 2025 +0100

    Merge pull request NVIDIA#707 from jgehrcke/jp/version25120

    Increment version to 25.12.0-dev

commit 89c8258
Merge: a772441 de830d3
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 29 13:07:49 2025 +0100

    Merge pull request NVIDIA#703 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.0-dev

    build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container

commit a772441
Merge: 7f591c2 2a2eeec
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 29 13:07:04 2025 +0100

    Merge pull request NVIDIA#705 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0

commit 9b20929
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 29 12:47:26 2025 +0100

    Increment version to 25.12.0-dev

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 2a2eeec
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 26 17:02:01 2025 +0000

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit

    Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.6 to 1.18.0.
    - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases)
    - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md)
    - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.6...v1.18.0)

    ---
    updated-dependencies:
    - dependency-name: github.com/NVIDIA/nvidia-container-toolkit
      dependency-version: 1.18.0
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit de830d3
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Oct 24 17:13:23 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.2.0-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 7f591c2
Merge: cfe35ff 70fbda6
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 22 16:56:21 2025 +0200

    Merge pull request NVIDIA#699 from jgehrcke/jp/readme-installation-instruction

    README: refer to external install instructions

commit 70fbda6
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 21 14:26:18 2025 +0200

    README: refer to external install instructions

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit cfe35ff
Merge: 2762688 151c766
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 17 14:56:42 2025 +0200

    Merge pull request NVIDIA#687 from jgehrcke/jp/unbreak-ci

    ci: fix downstream pipeline issues

commit 151c766
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 17 14:26:43 2025 +0200

    ci: bump regctl conservatively

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 7238e5d
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 17 14:26:24 2025 +0200

    ci: rename gl pipeline stages

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 87b7915
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Sep 3 17:04:33 2025 +0200

    ci: push image w/o version prefix

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 24e765d
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 17 12:27:52 2025 +0200

    ci: remove scan-images step

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 2762688
Merge: 1516ec7 784ba18
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 16 20:12:17 2025 +0200

    Merge pull request NVIDIA#685 from jgehrcke/jp/tests-v1-exactly

    tests: construct ResourceClaim differently on v1

commit 784ba18
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 16 17:30:44 2025 +0000

    tests: construct ResourceClaim differently on v1

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 1516ec7
Merge: 38b42bb e14beed
Author: Shiva Krishna Merla <[email protected]>
Date:   Thu Oct 16 10:01:55 2025 -0700

    Merge pull request NVIDIA#682 from shivamerla/fix_attestations

    Ensure attestation parameters are passed only for multi-arch builds using buildx.

commit 38b42bb
Merge: 0d83254 6cef363
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 16 11:27:20 2025 +0200

    Merge pull request NVIDIA#679 from jgehrcke/jp/tests-split-into-modules-add-failover

    tests: split into modules, add CD failover coverage

commit 6cef363
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 16 08:53:23 2025 +0000

    tests: explicit log on launcher container start, misc

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit db70cd7
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 15:27:53 2025 +0000

    tests: add test_cd_failover.bats and support

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 38036ac
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 15:16:21 2025 +0000

    tests: split tests.bats into modules

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e14beed
Author: Shiva Krishna, Merla <[email protected]>
Date:   Wed Oct 15 11:52:42 2025 -0700

    Ensure attestation parameters are passed only for multi-arch builds using buildx.

    Signed-off-by: Shiva Krishna, Merla <[email protected]>

commit 0d83254
Merge: 65cd2c5 f8ace2e
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 18:06:27 2025 +0200

    Merge pull request NVIDIA#676 from jgehrcke/jp/curl-retry-tcp-rst

    build: retry TCP RST when curling bash source

commit 65cd2c5
Merge: b3f4e07 c40b44b
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 16:59:35 2025 +0200

    Merge pull request NVIDIA#677 from jgehrcke/jp/test-abort-on-failure

    tests: abort suite on first failure, misc

commit c40b44b
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 11:24:06 2025 +0000

    tests: adjust readme

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 6e783bf
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 10:57:25 2025 +0000

    tests: rundir in /tmp (too much cruft in home dir)

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit dafa4f5
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 11:15:32 2025 +0000

    tests: merge two simple tests into one

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit c14c2ef
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 11:14:09 2025 +0000

    tests: add on_failure hook to emit debug info

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 89bb88a
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 11:12:22 2025 +0000

    tests: use new --abort flag for bats (fail suite fast)

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit f8ace2e
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 10:42:05 2025 +0000

    build: retry TCP RST when curling bash source

    Error seen:

    curl: (7) Failed to connect to mirror.cs.odu.edu port 443 after 306 ms: Connection refused

    By default, a TCP connection rejection (RST) is not treated
    by curl as a transient error, see

    https://curl.se/docs/manpage.html#--retry-connrefused

    It's a transient error in the sense that it's often
    a way to implement backpressure. We retry at slow rate.

    `--retry-all-errors` is what we want here, it includes
    `--retry-connrefused`.

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit b3f4e07
Merge: ab5a2b3 4e5cdf2
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 12:31:08 2025 +0200

    Merge pull request NVIDIA#669 from NVIDIA/dependabot/go_modules/main/google.golang.org/grpc-1.76.0

    build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0

commit ab5a2b3
Merge: 23ccbd2 803a35a
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 19:56:45 2025 +0200

    Merge pull request NVIDIA#675 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.3

    build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel

commit 803a35a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Oct 14 17:16:27 2025 +0000

    build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel

    Bumps golang from 1.25.2 to 1.25.3.

    ---
    updated-dependencies:
    - dependency-name: golang
      dependency-version: 1.25.3
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 23ccbd2
Merge: 83b8249 9d02cea
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 17:30:58 2025 +0200

    Merge pull request NVIDIA#672 from jgehrcke/jp/periodic-cleanup-partially-prepared-rcs

    CD kubelet plugin: add state reconciliation for partially prepared claims

commit 9d02cea
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Oct 13 13:03:11 2025 +0000

    tests: cover cleanup for stale partially prepared claims

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit f7a3310
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sun Oct 12 22:06:38 2025 +0000

    CD plugin: handle stale partially prepared claims

    Add a fundamentally required state reconciliation:

    Periodically, perform a self-initiated Unprepare() of previously
    partially prepared claims.

    Perform periodically:

    - Read checkpoint
    - Iterate through RCs in PrepareStarted state
    - For each: RC still known in API server?
        If not:
          1) initiate an Unprepare
          2) Remove from checkpoint file if unprepr was successful

    Relevance:

    Unpreparing any partially performed claim preparation might revert
    a state mutation that would otherwise be permanently inconsistent with
    API server state (e.g., this could remove a node label).

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 83b8249
Merge: 5235bed e22cdba
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 15:11:35 2025 +0200

    Merge pull request NVIDIA#674 from jgehrcke/jp/use-custom-config-dir-for-daemon

    CD daemon: /imexd instead of /etc/nvidia-imex

commit e22cdba
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 07:15:09 2025 +0000

    CD daemon: /imexd instead of /etc/nvidia-imex

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 5235bed
Merge: 7b5e2cd aa15924
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 12:50:05 2025 +0200

    Merge pull request NVIDIA#658 from jgehrcke/jp/log-full-component-config-on-startup

    Log full startup config in all CLIs in `Before` hook

commit aa15924
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 21:09:36 2025 +0000

    tests: confirm startup config logged on lvl 0

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e2ea590
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Sep 29 13:24:00 2025 +0000

    Introduce LogStartupConfig(), use in all CLIs in Before() hook

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4e5cdf2
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Oct 13 08:53:56 2025 +0000

    build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0

    Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.75.1 to 1.76.0.
    - [Release notes](https://github.com/grpc/grpc-go/releases)
    - [Commits](grpc/grpc-go@v1.75.1...v1.76.0)

    ---
    updated-dependencies:
    - dependency-name: google.golang.org/grpc
      dependency-version: 1.76.0
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 7b5e2cd
Merge: a1d2fd7 11f6c02
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Oct 13 10:37:08 2025 +0200

    Merge pull request NVIDIA#670 from NVIDIA/dependabot/go_modules/main/golang.org/x/time-0.14.0

    build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0

commit a1d2fd7
Merge: c614e61 6b2af09
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Oct 13 10:32:50 2025 +0200

    Merge pull request NVIDIA#671 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0-rc.6

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.5 to 1.18.0-rc.6

commit 6b2af09
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 12 17:02:23 2025 +0000

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit

    Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.5 to 1.18.0-rc.6.
    - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases)
    - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md)
    - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.5...v1.18.0-rc.6)

    ---
    updated-dependencies:
    - dependency-name: github.com/NVIDIA/nvidia-container-toolkit
      dependency-version: 1.18.0-rc.6
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 11f6c02
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 12 17:02:18 2025 +0000

    build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0

    Bumps [golang.org/x/time](https://github.com/golang/time) from 0.9.0 to 0.14.0.
    - [Commits](golang/time@v0.9.0...v0.14.0)

    ---
    updated-dependencies:
    - dependency-name: golang.org/x/time
      dependency-version: 0.14.0
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit c614e61
Merge: a79a9fd 4ced422
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 16:54:20 2025 +0200

    Merge pull request NVIDIA#633 from jgehrcke/jp/verbosity-vs-debuggability-improvements

    Add `logVerbosity` Helm chart parameter, reduce default log verbosity

commit 4ced422
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 14:45:52 2025 +0000

    Remove newline, document env-based log verb flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4cf3d9b
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 17:57:00 2025 +0000

    Fix a typo in an error message

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 2c943f7
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 13:48:35 2025 +0000

    tests: remove sinful duplicate env strategy

    This also had a side effect on subsequent tests, with the
    controller starting with _no_ LOG_VERBOSITY environment variable
    set. I don't understand that, but that must be a funky Helm-ism.

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 3d5c51f
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 13:01:21 2025 +0000

    tests: fix: wait for controller flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit b172342
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 12:21:07 2025 +0000

    tests: replace hard-coded sleep with dynamic wait

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 9748095
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 19:36:24 2025 +0000

    tests: cover CD daemon log levels

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4767092
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 11:38:40 2025 +0000

    Helm logVerbosity param: add docs, start building tests

    Helm values.yaml: defaultLogVerbosity incl. docs

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    values.yaml: tweak, based on in log level insights

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    improve helm chart artifact commentary

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    squash: tweak docs

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    Rename chart var, start building tests

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    tests: cover log verbosity set per-component via env

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    helm: rename defaultLogVerbosity to logVerbosity

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 3828da9
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 18:28:50 2025 +0000

    CD daemon: change verbosity of "wait for nodes update" message

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 6d35ac1
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 17:56:31 2025 +0000

    CD controller: make CD daemon verbosity a required arg

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 84530ab
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 14:07:33 2025 +0000

    CD controller: log manager config on startup

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit bb16c33
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 11:31:36 2025 +0000

    CD controller/plugins/daemon: introduce LOG_VERBOSITY

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 7e89b22
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 11:29:46 2025 +0000

    CD controller: introduce LOG_VERBOSITY_CD_DAEMON

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit c5b147b
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 14:10:33 2025 +0000

    tests: add note about instability around chart flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4cc705a
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 11:46:10 2025 +0000

    Helm: expose kubelet plugin env via chart variables

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 5f143b2
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 15:53:07 2025 +0000

    Upper-case log msg, no explicit verb 0

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 8321983
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Sep 30 12:16:17 2025 +0000

    Change log message levels according to new system

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit a36e214
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Sep 30 12:12:38 2025 +0000

    Add logVerbosity Helm chart parameter

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit a79a9fd
Merge: 3903df7 6e56823
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 13:43:17 2025 +0200

    Merge pull request NVIDIA#646 from jgehrcke/jp/no-clique-update-cd-node-status

    Release workload on a non-MNNVL node in a CD

commit 6e56823
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 19:47:48 2025 +0000

    CD plugin: move CDI edit gen into computeDomainDaemonSettings

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    make diff smaller, rename func

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit f7e4a45
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 16:30:27 2025 +0000

    CD daemon: always mount in IMEX daemon config files

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    CD plugin: always prepare IMEX config on the host and mount it in

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit c040429
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 15:59:07 2025 +0000

    Fix typos in comments and log message

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit deccb4d
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 11:39:33 2025 +0000

    CD plugin: always inject CD details via CDI

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    Rename 'domain' to 'domainID'

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    squash: review feedback

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    shorten comment

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 023e7f9
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 11:38:23 2025 +0000

    Enrich error message with CD detail when CD not found

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 32180ad
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 11:37:43 2025 +0000

    CD daemon: unconditionally write IMEX daemon config

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    Break out of select/case, MkdirAll() before writing file

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 13df4da
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 09:58:50 2025 +0000

    CD daemon: init node status as NotReady, misc log msg & comment tweaks

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 3cbd5a4
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 09:55:47 2025 +0000

    CD daemon: keep business logic in no-IMEX-daemon noop mode

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit fffcea2
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 09:50:30 2025 +0000

    Introduce maxNodesPerIMEXDomain special case for empty cliqueID

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e0b8990
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 09:49:07 2025 +0000

    Update code comments

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 3903df7
Merge: 14dc9fe 72e39e9
Author: Kevin Klues <[email protected]>
Date:   Fri Oct 10 13:22:31 2025 +0200

    Merge pull request NVIDIA#661 from jgehrcke/jp/flush-logs-on-shutdown

    Flush logs in CLI app `After` hook

commit 14dc9fe
Merge: 8788dd1 d34a12f
Author: Kevin Klues <[email protected]>
Date:   Fri Oct 10 13:16:53 2025 +0200

    Merge pull request NVIDIA#656 from jgehrcke/jp/custom-rate-limiting

    Introduce DefaultPrepUnprepRateLimiter (less aggressive)

commit 8788dd1
Merge: 23d205f 0770c0a
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 12:43:33 2025 +0200

    Merge pull request NVIDIA#666 from klueska/rbac-update

    Separate controller and kubeletplugin into separate RBAC permissions

commit 0770c0a
Author: Kevin Klues <[email protected]>
Date:   Thu Oct 9 13:41:03 2025 +0000

    Separate controller and kubeletplugin into separate RBAC permissions

    Signed-off-by: Kevin Klues <[email protected]>

commit 23d205f
Merge: fca1c08 816c7a1
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 10:01:28 2025 +0200

    Merge pull request NVIDIA#664 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.1.13-dev

    build(deps): bump nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev in /deployments/container

commit fca1c08
Merge: e089759 b15d633
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 09:56:21 2025 +0200

    Merge pull request NVIDIA#665 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.2

    build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel

commit b15d633
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Oct 8 17:15:07 2025 +0000

    build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel

    Bumps golang from 1.25.1 to 1.25.2.

    ---
    updated-dependencies:
    - dependency-name: golang
      dependency-version: 1.25.2
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 816c7a1
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Oct 8 17:15:03 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.1.13-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 72e39e9
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Sep 30 09:42:57 2025 +0000

    Flush logs in CLI app `After` hook

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit d34a12f
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 8 15:18:53 2025 +0200

    Adjust go.mod to recent changes

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 7e18c33
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Sep 30 12:18:41 2025 +0000

    Introduce DefaultPrepUnprepRateLimiter (less aggressive)

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e089759
Merge: 765892d e9f647e
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 8 13:09:32 2025 +0200

    Merge pull request NVIDIA#651 from jgehrcke/jp/issue-694

    CD daemon: coordinate CD updates on shutdown via mutation cache

commit e9f647e
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 17:55:26 2025 +0000

    tests: cover CD daemon cleanup-on-shutdown

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 980a6a1
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 17:06:42 2025 +0000

    CD daemon: pod mngr: store UpdateStatus return value in mutation cache

    This makes sure that fast incremental mutations on
    the same CD object performed during shutdown are done
    conflict-free (i.e., in actual, incremental fashion
    using intermediate state returned by the API server).

    Without this patch:

    I1007 16:49:01.678050       1 podmanager.go:196] Successfully updated node gb-nvl-043-compute06 status to NotReady
    E1007 16:49:01.681345       1 computedomain.go:161] Failed to remove node from ComputeDomain during shutdown: [...] \
    				"the object has been modified" [...]

    With this patch:

    I1007 16:59:55.350436       1 podmanager.go:200] Successfully updated node gb-nvl-043-compute07 status to NotReady
    I1007 16:59:55.353551       1 computedomain.go:402] Successfully removed node with IP 192.168.34.153 from ComputeDomain default/imex-channel-injection

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4b91fce
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 15:50:06 2025 +0000

    CD daemon: coordinate CD updates on shutdown via mutationcache

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 765892d
Merge: 2b7e899 754a758
Author: Kevin Klues <[email protected]>
Date:   Wed Oct 8 09:52:51 2025 +0200

    Merge pull request NVIDIA#650 from NVIDIA/dependabot/github_actions/github/codeql-action-4

    build(deps): bump github/codeql-action from 3 to 4

commit 754a758
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Oct 7 17:08:56 2025 +0000

    build(deps): bump github/codeql-action from 3 to 4

    Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3 to 4.
    - [Release notes](https://github.com/github/codeql-action/releases)
    - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
    - [Commits](github/codeql-action@v3...v4)

    ---
    updated-dependencies:
    - dependency-name: github/codeql-action
      dependency-version: '4'
      dependency-type: direct:production
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants