Add rbac for deployment #230

guptaNswati · 2025-01-22T20:51:52Z

$ kubectl auth can-i get deployments --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

Signed-off-by: Swati Gupta <[email protected]>

dekonnection · 2025-01-23T14:56:47Z

Hi, I tried your branch, it fixed the initial error but now I get this one:

Failed to prepare dynamic resources: NodePrepareResources failed for claim lab/mps-gpu-7c9db8954b-mbwj5-mps-gpus-682s2: error preparing devices for claim 37925188-b216-4aa3-8ca7-f32cf28476ae: prepare devices failed: error applying GPU config: MPS control daemon is not yet ready: error listing pods from deployment

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-01-23T19:26:35Z

@dekonnection I added permissions for job and pods. Hopefully that will fix the errors.

guptaNswati · 2025-01-30T21:21:11Z

quick test with the added permissions

updated the demo/specs/quickstart/gpu-test-mps.yaml  spec to set 
restartPolicy: "Never"
  image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
  command: ["bash", "-c"]
  args: ["/tmp/sample -benchmark -i=5000"]
 
$ kubectl apply -f demo/specs/quickstart/gpu-test-mps.yaml 
 
$ kubectl get pods -n gpu-test-mps 
NAME       READY   STATUS              RESTARTS   AGE
test-pod   0/2     ContainerCreating   0          13m
 
 $ kubectl describe pod test-pod -n gpu-test-mps 
   Warning  FailedPrepareDynamicResources  64s (x2 over 2m13s)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim gpu-test-mps/test-pod-shared-gpu-stwn2: error preparing devices for claim 7d607266-069f-4162-90d6-6a6ed85ea459: prepare devices failed: error applying GPU config: error starting MPS control daemon: error checking if control daemon already started: failed to get deployment: deployments.apps "mps-control-daemon-7d607266-069f-4162-90d6-6a6ed85ea459-77bf4" is forbidden: User "system:serviceaccount:nvidia:nvidia-dra-driver-k8s-dra-driver-service-account" cannot get resource "deployments" in API group "apps" in the namespace "nvidia"

$ kubectl apply -f role.yaml
role.rbac.authorization.k8s.io/nvidia-dra-driver-k8s-dra-driver-app-role created

$ kubectl apply -f rolebinding.yaml
rolebinding.rbac.authorization.k8s.io/nvidia-dra-driver-k8s-dra-driver-app-role-binding created

$ kubectl auth can-i get pods --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl auth can-i get job --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl auth can-i get deployments --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl get pods -n gpu-test-mps
NAME       READY   STATUS      RESTARTS   AGE
test-pod   0/2     Completed   0          6m38s

nvidia@SC-Starwars-MAB9-B00:k8s-dra-driver$ kubectl logs test-pod -n gpu-test-mps
Defaulted container "mps-ctr0" out of: mps-ctr0, mps-ctr1
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance) 
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 9.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 9.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 9.0

> Compute 9.0 CUDA device: [NVIDIA GH200 96GB HBM3]
67584 bodies, total time for 5000 iterations: 26196.221 ms
= 871.805 billion interactions per second
= 17436.092 single-precision GFLOP/s at 20 flops per interaction

$ kubectl logs test-pod -c mps-ctr1 -n gpu-test-mps
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance) 
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 9.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 9.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 9.0

> Compute 9.0 CUDA device: [NVIDIA GH200 96GB HBM3]
67584 bodies, total time for 5000 iterations: 26197.629 ms
= 871.758 billion interactions per second
= 17435.154 single-precision GFLOP/s at 20 flops per interaction

add rbac for deployment

ed7dcca

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati requested a review from cdesiniotis January 22, 2025 20:51

add permissions for pods and job

2f03b0a

Signed-off-by: Swati Gupta <[email protected]>

cdesiniotis approved these changes Jan 31, 2025

View reviewed changes

guptaNswati merged commit fe64609 into NVIDIA:main Jan 31, 2025
6 checks passed

klueska added this to Planning Board: k8s-dra-driver-gpu Jun 16, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Jun 16, 2025

klueska added this to the v25.3.0 milestone Aug 13, 2025

guptaNswati self-assigned this Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add rbac for deployment #230

Add rbac for deployment #230

Uh oh!

guptaNswati commented Jan 22, 2025

Uh oh!

dekonnection commented Jan 23, 2025

Uh oh!

guptaNswati commented Jan 23, 2025

Uh oh!

guptaNswati commented Jan 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add rbac for deployment #230

Add rbac for deployment #230

Uh oh!

Conversation

guptaNswati commented Jan 22, 2025

Uh oh!

dekonnection commented Jan 23, 2025

Uh oh!

guptaNswati commented Jan 23, 2025

Uh oh!

guptaNswati commented Jan 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants