Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Nov 3, 2025

As I work on dynamic MIG device allocation support, I need to rely on tests to not break what currently works. So, I started covering basic GPU allocation functionality in the test suite.

This patch also contains changes for test environment management, cleanup helper adjustments, etc.

Tests are passing (the section test_gpu_basic.bats below is new):

test_basics.bats
 ✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [195]
 ✓ confirm no kubelet plugin pods running [176]
 ✓ helm-install deployments/helm/nvidia-dra-driver-gpu/25.12.0-dev [6809]
 ✓ helm list: validate output [229]
 ✓ get crd computedomains.resource.nvidia.com [156]
 ✓ wait for plugin & controller pods READY [752]
 ✓ validate CD controller container image spec [167]
test_gpu_basic.bats
 ✓ 1 pod(s), 1 full GPU [4297]
 ✓ 2 pod(s), 1 full GPU each [5180]
 ✓ 2 pod(s), 1 full GPU (shared, 1 RC) [5097]
 ✓ 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) [4559]
test_cd_imex_chan_inject.bats
 ✓ IMEX channel injection (single) [13210]
 ✓ IMEX channel injection (all) [9381]
test_cd_mnnvl_workload.bats
 ✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [10567]
 ✓ nvbandwidth (2 nodes, 2 GPUs each) [26027]
test_cd_misc.bats
 ✓ CD daemon shutdown: confirm CD status cleanup [13226]
 ✓ reject unknown field in opaque cfg in CD chan ResourceClaim [9341]
 ✓ self-initiated unprepare of stale RCs in PrepareStarted [24027]
test_cd_logging.bats
 ✓ CD controller/plugin: startup config / detail in logs on level 0 [5993]
 ✓ CD controller: test log verbosity levels [52444]
 ✓ CD daemon: test log verbosity levels [30684]
test_cd_failover.bats
 ✓ CD failover nvb2: force-delete worker pod 0 [60367]
 ✓ CD failover nvb2: force-delete all IMEX daemons [39743]
 ✓ CD failover nvb2: regular-delete worker pod 1 [53460]
test_cd_updowngrade.bats
 ✓ downgrade: current-dev -> last-stable [40410]
 ✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [32076]

26 tests, 0 failures in 473 seconds

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@jgehrcke jgehrcke added this to the v25.12.0 milestone Nov 3, 2025
@jgehrcke jgehrcke self-assigned this Nov 3, 2025
@jgehrcke jgehrcke added the ci/testing issue/PR related to CI and/or testing label Nov 3, 2025
.PHONY: tests
tests: image
export _RUNDIR_PARENT=/tmp/k8s-dra-driver-gpu-tests-out && \
export _RUNDIR_PARENT=/tmp/k8s-dra-driver-gpu-tests-out-$${USER} && \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes negative interaction between test suite users on a shared user machine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(removed from this PR)

@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Nov 3, 2025
"name": "host-root"
}],
"command": ["/bin/sh", "-c", "rm -rfv /host/var/lib/kubelet/plugins/*"]
"command": ["/bin/sh", "-c", "rm -rfv /host/run/cdi/* ; rm -rfv /host/var/lib/kubelet/plugins/*"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤓 this removes also CDI devices defined by k8s.device-plugin.nvidia.com -- slightly too aggressive for those tests that still rely on requesting GPUs via the traditional method. I will refine that wildcard to leave the device plugin's spec file(s) alone.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

misc fixes

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

remove cdi spec removal again

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@@ -0,0 +1,51 @@
# Two pods, one container each, each container gets 1 distinct full GPU
Copy link
Collaborator Author

@jgehrcke jgehrcke Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,39 @@
# One pod, two containers.
# Each asking for shared access to a single, full GPU.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,51 @@
# Simple GPU sharing scenario:
# Two pods, one container each, each getting access to the same, shared GPU
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Run cmd in nvidia-mig-manager pod because that one has highest privileges. I
# use this for example to run `nvcnt gb-nvl-027-compute06 nvidia-smi`
nvmm() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used in this PR, but I use this now ~routinely in local dev / testing; and I can see that we will use it in the test suite, too.

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Nov 3, 2025

Landing this now to keep my local branch management a little simpler (rebasing my local dev state on top of this after merge).

@jgehrcke jgehrcke merged commit 59d775b into NVIDIA:main Nov 3, 2025
14 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Nov 3, 2025
asymingt added a commit to asymingt/k8s-dra-driver-gpu that referenced this pull request Nov 14, 2025
commit 55fc7b0
Merge: 5443e0f ef23484
Author: Shiva Krishna Merla <[email protected]>
Date:   Thu Nov 6 16:04:43 2025 -0800

    Merge pull request NVIDIA#668 from varunrsekar/vfio-support-1.33

    Support VFIO passthrough

commit ef23484
Author: Varun Ramachandra Sekar <[email protected]>
Date:   Tue Oct 14 17:29:18 2025 -0700

    vfio passthrough support

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    use chroot to run modprobe

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    deadvertise sibling devices on preparation

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    soft check for VFs before attempting unbind

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    address review comments

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    address comments (2)

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    use fuser to check if gpu is free

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    remove unnecessary securityContext

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

    don't mix vfio and mig devices

    Signed-off-by: Varun Ramachandra Sekar <[email protected]>

commit 5443e0f
Merge: 59d775b 3babfe5
Author: Shiva Krishna Merla <[email protected]>
Date:   Tue Nov 4 12:48:00 2025 -0800

    Merge pull request NVIDIA#711 from shivamerla/add_gpu_stress_tests

    tests: Add separate targets for GPU plugin tests + add stress tests

commit 3babfe5
Author: Shiva Krishna, Merla <[email protected]>
Date:   Tue Nov 4 11:47:01 2025 -0800

    tests: Use BATS_TEST_TMPDIR and failfast on errors during cleanup

    Signed-off-by: Shiva Krishna, Merla <[email protected]>

commit 2b3e70b
Author: Shiva Krishna, Merla <[email protected]>
Date:   Tue Nov 4 11:07:19 2025 -0800

    tests: Add separate targets for GPU plugin tests + add stress tests

    * Add separate make targets to run GPU and CD specific tests
    * Add a stress test for GPU allocation
    * Refactor Makefile to share common docker setup between targets

    Signed-off-by: Shiva Krishna, Merla <[email protected]>

commit 59d775b
Merge: 852b56f 1e79179
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Nov 3 19:38:02 2025 +0100

    Merge pull request NVIDIA#709 from jgehrcke/jp/basic-gpu-tests

    tests: cover basic GPU allocation, misc improvements

commit 852b56f
Merge: 1ee1b4a e8fa8e6
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Nov 3 19:21:53 2025 +0100

    Merge pull request NVIDIA#706 from Gacko/vkptt

    kubelet plugins: add /opt/bin to binary search paths

commit 1ee1b4a
Merge: f4d11e3 068bb76
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Nov 3 19:10:16 2025 +0100

    Merge pull request NVIDIA#710 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.1-dev

    build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container

commit 1e79179
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Nov 1 11:44:03 2025 -0700

    tests: cover basic GPU allocation

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    misc fixes

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    remove cdi spec removal again

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 068bb76
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Nov 3 17:59:31 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.2.1-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit fcd74d1
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Nov 1 11:42:35 2025 -0700

    tests: add nvmm helper

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 977f421
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Nov 1 11:42:10 2025 -0700

    tests: per-user tmp dir (relevant on shared machines)

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 1c2da2c
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Nov 1 11:41:09 2025 -0700

    tests: parallelize per-node state dir cleanup

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e8fa8e6
Author: Marco Ebert <[email protected]>
Date:   Wed Oct 29 09:52:34 2025 +0100

    kubelet plugins: add /opt/bin to binary search paths

    Signed-off-by: Marco Ebert <[email protected]>

commit f4d11e3
Merge: 89c8258 9b20929
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 29 13:09:19 2025 +0100

    Merge pull request NVIDIA#707 from jgehrcke/jp/version25120

    Increment version to 25.12.0-dev

commit 89c8258
Merge: a772441 de830d3
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 29 13:07:49 2025 +0100

    Merge pull request NVIDIA#703 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.0-dev

    build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container

commit a772441
Merge: 7f591c2 2a2eeec
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 29 13:07:04 2025 +0100

    Merge pull request NVIDIA#705 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0

commit 9b20929
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 29 12:47:26 2025 +0100

    Increment version to 25.12.0-dev

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 2a2eeec
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 26 17:02:01 2025 +0000

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit

    Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.6 to 1.18.0.
    - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases)
    - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md)
    - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.6...v1.18.0)

    ---
    updated-dependencies:
    - dependency-name: github.com/NVIDIA/nvidia-container-toolkit
      dependency-version: 1.18.0
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit de830d3
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Oct 24 17:13:23 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.2.0-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 7f591c2
Merge: cfe35ff 70fbda6
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 22 16:56:21 2025 +0200

    Merge pull request NVIDIA#699 from jgehrcke/jp/readme-installation-instruction

    README: refer to external install instructions

commit 70fbda6
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 21 14:26:18 2025 +0200

    README: refer to external install instructions

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit cfe35ff
Merge: 2762688 151c766
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 17 14:56:42 2025 +0200

    Merge pull request NVIDIA#687 from jgehrcke/jp/unbreak-ci

    ci: fix downstream pipeline issues

commit 151c766
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 17 14:26:43 2025 +0200

    ci: bump regctl conservatively

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 7238e5d
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 17 14:26:24 2025 +0200

    ci: rename gl pipeline stages

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 87b7915
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Sep 3 17:04:33 2025 +0200

    ci: push image w/o version prefix

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 24e765d
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 17 12:27:52 2025 +0200

    ci: remove scan-images step

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 2762688
Merge: 1516ec7 784ba18
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 16 20:12:17 2025 +0200

    Merge pull request NVIDIA#685 from jgehrcke/jp/tests-v1-exactly

    tests: construct ResourceClaim differently on v1

commit 784ba18
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 16 17:30:44 2025 +0000

    tests: construct ResourceClaim differently on v1

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 1516ec7
Merge: 38b42bb e14beed
Author: Shiva Krishna Merla <[email protected]>
Date:   Thu Oct 16 10:01:55 2025 -0700

    Merge pull request NVIDIA#682 from shivamerla/fix_attestations

    Ensure attestation parameters are passed only for multi-arch builds using buildx.

commit 38b42bb
Merge: 0d83254 6cef363
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 16 11:27:20 2025 +0200

    Merge pull request NVIDIA#679 from jgehrcke/jp/tests-split-into-modules-add-failover

    tests: split into modules, add CD failover coverage

commit 6cef363
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 16 08:53:23 2025 +0000

    tests: explicit log on launcher container start, misc

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit db70cd7
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 15:27:53 2025 +0000

    tests: add test_cd_failover.bats and support

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 38036ac
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 15:16:21 2025 +0000

    tests: split tests.bats into modules

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e14beed
Author: Shiva Krishna, Merla <[email protected]>
Date:   Wed Oct 15 11:52:42 2025 -0700

    Ensure attestation parameters are passed only for multi-arch builds using buildx.

    Signed-off-by: Shiva Krishna, Merla <[email protected]>

commit 0d83254
Merge: 65cd2c5 f8ace2e
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 18:06:27 2025 +0200

    Merge pull request NVIDIA#676 from jgehrcke/jp/curl-retry-tcp-rst

    build: retry TCP RST when curling bash source

commit 65cd2c5
Merge: b3f4e07 c40b44b
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 16:59:35 2025 +0200

    Merge pull request NVIDIA#677 from jgehrcke/jp/test-abort-on-failure

    tests: abort suite on first failure, misc

commit c40b44b
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 11:24:06 2025 +0000

    tests: adjust readme

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 6e783bf
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 10:57:25 2025 +0000

    tests: rundir in /tmp (too much cruft in home dir)

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit dafa4f5
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 11:15:32 2025 +0000

    tests: merge two simple tests into one

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit c14c2ef
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 11:14:09 2025 +0000

    tests: add on_failure hook to emit debug info

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 89bb88a
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 11:12:22 2025 +0000

    tests: use new --abort flag for bats (fail suite fast)

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit f8ace2e
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 10:42:05 2025 +0000

    build: retry TCP RST when curling bash source

    Error seen:

    curl: (7) Failed to connect to mirror.cs.odu.edu port 443 after 306 ms: Connection refused

    By default, a TCP connection rejection (RST) is not treated
    by curl as a transient error, see

    https://curl.se/docs/manpage.html#--retry-connrefused

    It's a transient error in the sense that it's often
    a way to implement backpressure. We retry at slow rate.

    `--retry-all-errors` is what we want here, it includes
    `--retry-connrefused`.

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit b3f4e07
Merge: ab5a2b3 4e5cdf2
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 15 12:31:08 2025 +0200

    Merge pull request NVIDIA#669 from NVIDIA/dependabot/go_modules/main/google.golang.org/grpc-1.76.0

    build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0

commit ab5a2b3
Merge: 23ccbd2 803a35a
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 19:56:45 2025 +0200

    Merge pull request NVIDIA#675 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.3

    build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel

commit 803a35a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Oct 14 17:16:27 2025 +0000

    build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel

    Bumps golang from 1.25.2 to 1.25.3.

    ---
    updated-dependencies:
    - dependency-name: golang
      dependency-version: 1.25.3
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 23ccbd2
Merge: 83b8249 9d02cea
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 17:30:58 2025 +0200

    Merge pull request NVIDIA#672 from jgehrcke/jp/periodic-cleanup-partially-prepared-rcs

    CD kubelet plugin: add state reconciliation for partially prepared claims

commit 9d02cea
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Oct 13 13:03:11 2025 +0000

    tests: cover cleanup for stale partially prepared claims

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit f7a3310
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sun Oct 12 22:06:38 2025 +0000

    CD plugin: handle stale partially prepared claims

    Add a fundamentally required state reconciliation:

    Periodically, perform a self-initiated Unprepare() of previously
    partially prepared claims.

    Perform periodically:

    - Read checkpoint
    - Iterate through RCs in PrepareStarted state
    - For each: RC still known in API server?
        If not:
          1) initiate an Unprepare
          2) Remove from checkpoint file if unprepr was successful

    Relevance:

    Unpreparing any partially performed claim preparation might revert
    a state mutation that would otherwise be permanently inconsistent with
    API server state (e.g., this could remove a node label).

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 83b8249
Merge: 5235bed e22cdba
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 15:11:35 2025 +0200

    Merge pull request NVIDIA#674 from jgehrcke/jp/use-custom-config-dir-for-daemon

    CD daemon: /imexd instead of /etc/nvidia-imex

commit e22cdba
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 07:15:09 2025 +0000

    CD daemon: /imexd instead of /etc/nvidia-imex

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 5235bed
Merge: 7b5e2cd aa15924
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 14 12:50:05 2025 +0200

    Merge pull request NVIDIA#658 from jgehrcke/jp/log-full-component-config-on-startup

    Log full startup config in all CLIs in `Before` hook

commit aa15924
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 21:09:36 2025 +0000

    tests: confirm startup config logged on lvl 0

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e2ea590
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Sep 29 13:24:00 2025 +0000

    Introduce LogStartupConfig(), use in all CLIs in Before() hook

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4e5cdf2
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Oct 13 08:53:56 2025 +0000

    build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0

    Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.75.1 to 1.76.0.
    - [Release notes](https://github.com/grpc/grpc-go/releases)
    - [Commits](grpc/grpc-go@v1.75.1...v1.76.0)

    ---
    updated-dependencies:
    - dependency-name: google.golang.org/grpc
      dependency-version: 1.76.0
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 7b5e2cd
Merge: a1d2fd7 11f6c02
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Oct 13 10:37:08 2025 +0200

    Merge pull request NVIDIA#670 from NVIDIA/dependabot/go_modules/main/golang.org/x/time-0.14.0

    build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0

commit a1d2fd7
Merge: c614e61 6b2af09
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Mon Oct 13 10:32:50 2025 +0200

    Merge pull request NVIDIA#671 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0-rc.6

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.5 to 1.18.0-rc.6

commit 6b2af09
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 12 17:02:23 2025 +0000

    build(deps): bump github.com/NVIDIA/nvidia-container-toolkit

    Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.5 to 1.18.0-rc.6.
    - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases)
    - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md)
    - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.5...v1.18.0-rc.6)

    ---
    updated-dependencies:
    - dependency-name: github.com/NVIDIA/nvidia-container-toolkit
      dependency-version: 1.18.0-rc.6
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 11f6c02
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Oct 12 17:02:18 2025 +0000

    build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0

    Bumps [golang.org/x/time](https://github.com/golang/time) from 0.9.0 to 0.14.0.
    - [Commits](golang/time@v0.9.0...v0.14.0)

    ---
    updated-dependencies:
    - dependency-name: golang.org/x/time
      dependency-version: 0.14.0
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit c614e61
Merge: a79a9fd 4ced422
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 16:54:20 2025 +0200

    Merge pull request NVIDIA#633 from jgehrcke/jp/verbosity-vs-debuggability-improvements

    Add `logVerbosity` Helm chart parameter, reduce default log verbosity

commit 4ced422
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 14:45:52 2025 +0000

    Remove newline, document env-based log verb flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4cf3d9b
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 17:57:00 2025 +0000

    Fix a typo in an error message

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 2c943f7
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 13:48:35 2025 +0000

    tests: remove sinful duplicate env strategy

    This also had a side effect on subsequent tests, with the
    controller starting with _no_ LOG_VERBOSITY environment variable
    set. I don't understand that, but that must be a funky Helm-ism.

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 3d5c51f
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 13:01:21 2025 +0000

    tests: fix: wait for controller flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit b172342
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 12:21:07 2025 +0000

    tests: replace hard-coded sleep with dynamic wait

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 9748095
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 19:36:24 2025 +0000

    tests: cover CD daemon log levels

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4767092
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 11:38:40 2025 +0000

    Helm logVerbosity param: add docs, start building tests

    Helm values.yaml: defaultLogVerbosity incl. docs

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    values.yaml: tweak, based on in log level insights

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    improve helm chart artifact commentary

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    squash: tweak docs

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    Rename chart var, start building tests

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    tests: cover log verbosity set per-component via env

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    helm: rename defaultLogVerbosity to logVerbosity

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 3828da9
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 18:28:50 2025 +0000

    CD daemon: change verbosity of "wait for nodes update" message

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 6d35ac1
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 17:56:31 2025 +0000

    CD controller: make CD daemon verbosity a required arg

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 84530ab
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 14:07:33 2025 +0000

    CD controller: log manager config on startup

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit bb16c33
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 11:31:36 2025 +0000

    CD controller/plugins/daemon: introduce LOG_VERBOSITY

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 7e89b22
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 11:29:46 2025 +0000

    CD controller: introduce LOG_VERBOSITY_CD_DAEMON

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit c5b147b
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 14:10:33 2025 +0000

    tests: add note about instability around chart flip

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4cc705a
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 11:46:10 2025 +0000

    Helm: expose kubelet plugin env via chart variables

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 5f143b2
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 15:53:07 2025 +0000

    Upper-case log msg, no explicit verb 0

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 8321983
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Sep 30 12:16:17 2025 +0000

    Change log message levels according to new system

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit a36e214
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Sep 30 12:12:38 2025 +0000

    Add logVerbosity Helm chart parameter

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit a79a9fd
Merge: 3903df7 6e56823
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Sat Oct 11 13:43:17 2025 +0200

    Merge pull request NVIDIA#646 from jgehrcke/jp/no-clique-update-cd-node-status

    Release workload on a non-MNNVL node in a CD

commit 6e56823
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 19:47:48 2025 +0000

    CD plugin: move CDI edit gen into computeDomainDaemonSettings

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    make diff smaller, rename func

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit f7e4a45
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 16:30:27 2025 +0000

    CD daemon: always mount in IMEX daemon config files

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    CD plugin: always prepare IMEX config on the host and mount it in

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit c040429
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 15:59:07 2025 +0000

    Fix typos in comments and log message

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit deccb4d
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 11:39:33 2025 +0000

    CD plugin: always inject CD details via CDI

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    Rename 'domain' to 'domainID'

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    squash: review feedback

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    shorten comment

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 023e7f9
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 11:38:23 2025 +0000

    Enrich error message with CD detail when CD not found

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 32180ad
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 11:37:43 2025 +0000

    CD daemon: unconditionally write IMEX daemon config

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

    Break out of select/case, MkdirAll() before writing file

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 13df4da
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 09:58:50 2025 +0000

    CD daemon: init node status as NotReady, misc log msg & comment tweaks

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 3cbd5a4
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 09:55:47 2025 +0000

    CD daemon: keep business logic in no-IMEX-daemon noop mode

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit fffcea2
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 09:50:30 2025 +0000

    Introduce maxNodesPerIMEXDomain special case for empty cliqueID

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e0b8990
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 09:49:07 2025 +0000

    Update code comments

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 3903df7
Merge: 14dc9fe 72e39e9
Author: Kevin Klues <[email protected]>
Date:   Fri Oct 10 13:22:31 2025 +0200

    Merge pull request NVIDIA#661 from jgehrcke/jp/flush-logs-on-shutdown

    Flush logs in CLI app `After` hook

commit 14dc9fe
Merge: 8788dd1 d34a12f
Author: Kevin Klues <[email protected]>
Date:   Fri Oct 10 13:16:53 2025 +0200

    Merge pull request NVIDIA#656 from jgehrcke/jp/custom-rate-limiting

    Introduce DefaultPrepUnprepRateLimiter (less aggressive)

commit 8788dd1
Merge: 23d205f 0770c0a
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Fri Oct 10 12:43:33 2025 +0200

    Merge pull request NVIDIA#666 from klueska/rbac-update

    Separate controller and kubeletplugin into separate RBAC permissions

commit 0770c0a
Author: Kevin Klues <[email protected]>
Date:   Thu Oct 9 13:41:03 2025 +0000

    Separate controller and kubeletplugin into separate RBAC permissions

    Signed-off-by: Kevin Klues <[email protected]>

commit 23d205f
Merge: fca1c08 816c7a1
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 10:01:28 2025 +0200

    Merge pull request NVIDIA#664 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.1.13-dev

    build(deps): bump nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev in /deployments/container

commit fca1c08
Merge: e089759 b15d633
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Thu Oct 9 09:56:21 2025 +0200

    Merge pull request NVIDIA#665 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.2

    build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel

commit b15d633
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Oct 8 17:15:07 2025 +0000

    build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel

    Bumps golang from 1.25.1 to 1.25.2.

    ---
    updated-dependencies:
    - dependency-name: golang
      dependency-version: 1.25.2
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 816c7a1
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Oct 8 17:15:03 2025 +0000

    build(deps): bump nvidia/distroless/cc in /deployments/container

    Bumps nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev.

    ---
    updated-dependencies:
    - dependency-name: nvidia/distroless/cc
      dependency-version: v3.1.13-dev
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 72e39e9
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Sep 30 09:42:57 2025 +0000

    Flush logs in CLI app `After` hook

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit d34a12f
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 8 15:18:53 2025 +0200

    Adjust go.mod to recent changes

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 7e18c33
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Sep 30 12:18:41 2025 +0000

    Introduce DefaultPrepUnprepRateLimiter (less aggressive)

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit e089759
Merge: 765892d e9f647e
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Wed Oct 8 13:09:32 2025 +0200

    Merge pull request NVIDIA#651 from jgehrcke/jp/issue-694

    CD daemon: coordinate CD updates on shutdown via mutation cache

commit e9f647e
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 17:55:26 2025 +0000

    tests: cover CD daemon cleanup-on-shutdown

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 980a6a1
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 17:06:42 2025 +0000

    CD daemon: pod mngr: store UpdateStatus return value in mutation cache

    This makes sure that fast incremental mutations on
    the same CD object performed during shutdown are done
    conflict-free (i.e., in actual, incremental fashion
    using intermediate state returned by the API server).

    Without this patch:

    I1007 16:49:01.678050       1 podmanager.go:196] Successfully updated node gb-nvl-043-compute06 status to NotReady
    E1007 16:49:01.681345       1 computedomain.go:161] Failed to remove node from ComputeDomain during shutdown: [...] \
    				"the object has been modified" [...]

    With this patch:

    I1007 16:59:55.350436       1 podmanager.go:200] Successfully updated node gb-nvl-043-compute07 status to NotReady
    I1007 16:59:55.353551       1 computedomain.go:402] Successfully removed node with IP 192.168.34.153 from ComputeDomain default/imex-channel-injection

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 4b91fce
Author: Dr. Jan-Philip Gehrcke <[email protected]>
Date:   Tue Oct 7 15:50:06 2025 +0000

    CD daemon: coordinate CD updates on shutdown via mutationcache

    Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

commit 765892d
Merge: 2b7e899 754a758
Author: Kevin Klues <[email protected]>
Date:   Wed Oct 8 09:52:51 2025 +0200

    Merge pull request NVIDIA#650 from NVIDIA/dependabot/github_actions/github/codeql-action-4

    build(deps): bump github/codeql-action from 3 to 4

commit 754a758
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Oct 7 17:08:56 2025 +0000

    build(deps): bump github/codeql-action from 3 to 4

    Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3 to 4.
    - [Release notes](https://github.com/github/codeql-action/releases)
    - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
    - [Commits](github/codeql-action@v3...v4)

    ---
    updated-dependencies:
    - dependency-name: github/codeql-action
      dependency-version: '4'
      dependency-type: direct:production
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/testing issue/PR related to CI and/or testing

Projects

Development

Successfully merging this pull request may close these issues.

1 participant