Add ComputeDomain for running multi-node workloads #225

klueska · 2025-01-09T14:42:31Z

No description provided.

Signed-off-by: Kevin Klues <[email protected]>

For now just mark one well-known erro as permanent. Future commits will abstract this better and mark more errors as permanaent. Signed-off-by: Kevin Klues <[email protected]>

Signed-off-by: Kevin Klues <[email protected]>

ArangoGutierrez · 2025-02-19T16:24:55Z

api/nvidia.com/resource/v1beta1/computedomain.go

+// ComputeDomainSpec provides the spec for a ComputeDomain.
+type ComputeDomainSpec struct {
+	NumNodes int                       `json:"numNodes"`
+	Channel  *ComputeDomainChannelSpec `json:"channel"`


Should channel be optional thinking on non imex use cases in the future, I know currently we are solely focused on imex support, but if we want to carry on the concept of computeDomain, we might face clusters without imex (channels)

ArangoGutierrez · 2025-02-19T16:26:37Z

api/nvidia.com/resource/v1beta1/computedomainconfig.go

@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


maybe like some clothing brands do Since 1987* but I am not lawyer, maybe the number on the license header has a deeper legal meaning

ArangoGutierrez · 2025-02-19T16:29:46Z

api/nvidia.com/resource/v1beta1/register.go

@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved.


Is DRA 2022 old?

Signed-off-by: Kevin Klues <[email protected]>

jgehrcke · 2025-02-20T11:40:02Z

templates/compute-domain-daemon.tmpl.yaml

+            tail -f /dev/null & wait
+          fi
+          /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
+          tail -n +1 -f /var/log/nvidia-imex.log & wait


Discussed in a sync meeting a while ago: here we give up control of the IMEX daemon process (do we? how does errexit behave when a daemonized process exits non-zero?). In any case, for robustness and debuggability it will be good to actively monitor the health of the IMEX daemon process (polling the process, or better: getting a health signal actively and straight from the process). I'd like to look into that at some point, after merge.

This is still a problem. If the daemon crashes we will not exit (but the liveness probe will eventually fail and the pod will be restarted). We should make it more robust as a followup (probably by not doing everything in bash but instead writing a small go utility).

klueska force-pushed the add-multi-node-crd branch 10 times, most recently from 38065b4 to 3e51cd8 Compare January 14, 2025 08:15

klueska force-pushed the add-multi-node-crd branch 20 times, most recently from 4ce9bdb to 0d435d8 Compare January 22, 2025 15:56

klueska added 12 commits February 19, 2025 10:28

Use a daemonset instead of a deployment to run ComputeDomain daemons

7821db4

Signed-off-by: Kevin Klues <[email protected]>

Block ComputeDomain deletion while a workload is still running in it

e9c7101

Signed-off-by: Kevin Klues <[email protected]>

Add a liveness probe to the ComputeDomain daemon

9cbc576

Signed-off-by: Kevin Klues <[email protected]>

Ensure that ResourceClaim / ComputeDomain namespace are the same

3373134

Signed-off-by: Kevin Klues <[email protected]>

Add the notion of a "permanent" error to the kubelet plugin

993b853

For now just mark one well-known erro as permanent. Future commits will abstract this better and mark more errors as permanaent. Signed-off-by: Kevin Klues <[email protected]>

Harden logic around calling prepare / unprepare on allocated claims

7783596

Signed-off-by: Kevin Klues <[email protected]>

Abstract out getConfigResultsMap so it can be reused later

0e5611f

Signed-off-by: Kevin Klues <[email protected]>

Unconditionally unprepare imex channels and daemons

ef25561

Signed-off-by: Kevin Klues <[email protected]>

Treat a ClusterUUID of all 0s to mean no IMEX support as well

a37de39

Signed-off-by: Kevin Klues <[email protected]>

Add a level of indiraction with a new 'channel' field in ComputeDomain

4625953

Signed-off-by: Kevin Klues <[email protected]>

Ensure that the fabric-imex-mgmt nvcap is created and injected always

718e69d

Signed-off-by: Kevin Klues <[email protected]>

Recursively unmount /proc/driver/nvidia if it is mounted

5a83bac

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the add-multi-node-crd branch from 1e6a587 to a45c238 Compare February 19, 2025 10:28

klueska added 4 commits February 19, 2025 15:36

Add demo specs for working with compute domains

3ea7913

Signed-off-by: Kevin Klues <[email protected]>

Only inject channel / daemon settings if running on an IMEX capable node

578ab87

Signed-off-by: Kevin Klues <[email protected]>

Add periodic cleanup of stale objects owned by deleted ComputeDomains

4464442

Signed-off-by: Kevin Klues <[email protected]>

Allow the DRA driver for GPUs to be force installed if desired

222df11

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the add-multi-node-crd branch 2 times, most recently from df90001 to 764c3c7 Compare February 19, 2025 15:43

ArangoGutierrez reviewed Feb 19, 2025

View reviewed changes

Determine cliqueID from NVML not node label

474f968

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the add-multi-node-crd branch from 764c3c7 to 474f968 Compare February 19, 2025 17:00

jgehrcke reviewed Feb 20, 2025

View reviewed changes

klueska merged commit d1fad7e into NVIDIA:main Feb 20, 2025
4 checks passed

jgehrcke mentioned this pull request Mar 22, 2025

IMEX daemon startup error not caught, CD marked as ready #289

Open

klueska added this to Planning Board: k8s-dra-driver-gpu Jun 16, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Jun 16, 2025

klueska added this to the v25.3.0 milestone Aug 13, 2025

klueska deleted the add-multi-node-crd branch August 20, 2025 21:26

jgehrcke mentioned this pull request Nov 17, 2025

MIG-partitioned Nodes also have whole GPUs advertised in ResourceSlice #719

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ComputeDomain for running multi-node workloads #225

Add ComputeDomain for running multi-node workloads #225

Uh oh!

klueska commented Jan 9, 2025

Uh oh!

ArangoGutierrez Feb 19, 2025

Uh oh!

ArangoGutierrez Feb 19, 2025

Uh oh!

ArangoGutierrez Feb 19, 2025

Uh oh!

jgehrcke Feb 20, 2025

Uh oh!

klueska Feb 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,89 @@
		/*
		* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,49 @@
		/*
		* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

Add ComputeDomain for running multi-node workloads #225

Add ComputeDomain for running multi-node workloads #225

Uh oh!

Conversation

klueska commented Jan 9, 2025

Uh oh!

ArangoGutierrez Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

klueska Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants