Skip to content

Conversation

@guptaNswati
Copy link
Contributor

To fix #229 cc @dekonnection

$ kubectl auth can-i get deployments --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

Signed-off-by: Swati Gupta <[email protected]>
@dekonnection
Copy link

Hi, I tried your branch, it fixed the initial error but now I get this one:

Failed to prepare dynamic resources: NodePrepareResources failed for claim lab/mps-gpu-7c9db8954b-mbwj5-mps-gpus-682s2: error preparing devices for claim 37925188-b216-4aa3-8ca7-f32cf28476ae: prepare devices failed: error applying GPU config: MPS control daemon is not yet ready: error listing pods from deployment

@guptaNswati
Copy link
Contributor Author

@dekonnection I added permissions for job and pods. Hopefully that will fix the errors.

@guptaNswati
Copy link
Contributor Author

quick test with the added permissions

updated the demo/specs/quickstart/gpu-test-mps.yaml  spec to set 
restartPolicy: "Never"
  image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
  command: ["bash", "-c"]
  args: ["/tmp/sample -benchmark -i=5000"]
 
$ kubectl apply -f demo/specs/quickstart/gpu-test-mps.yaml 
 
$ kubectl get pods -n gpu-test-mps 
NAME       READY   STATUS              RESTARTS   AGE
test-pod   0/2     ContainerCreating   0          13m
 
 $ kubectl describe pod test-pod -n gpu-test-mps 
   Warning  FailedPrepareDynamicResources  64s (x2 over 2m13s)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim gpu-test-mps/test-pod-shared-gpu-stwn2: error preparing devices for claim 7d607266-069f-4162-90d6-6a6ed85ea459: prepare devices failed: error applying GPU config: error starting MPS control daemon: error checking if control daemon already started: failed to get deployment: deployments.apps "mps-control-daemon-7d607266-069f-4162-90d6-6a6ed85ea459-77bf4" is forbidden: User "system:serviceaccount:nvidia:nvidia-dra-driver-k8s-dra-driver-service-account" cannot get resource "deployments" in API group "apps" in the namespace "nvidia"

$ kubectl apply -f role.yaml
role.rbac.authorization.k8s.io/nvidia-dra-driver-k8s-dra-driver-app-role created

$ kubectl apply -f rolebinding.yaml
rolebinding.rbac.authorization.k8s.io/nvidia-dra-driver-k8s-dra-driver-app-role-binding created

$ kubectl auth can-i get pods --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl auth can-i get job --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl auth can-i get deployments --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl get pods -n gpu-test-mps
NAME       READY   STATUS      RESTARTS   AGE
test-pod   0/2     Completed   0          6m38s

nvidia@SC-Starwars-MAB9-B00:k8s-dra-driver$ kubectl logs test-pod -n gpu-test-mps
Defaulted container "mps-ctr0" out of: mps-ctr0, mps-ctr1
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance) 
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 9.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 9.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 9.0

> Compute 9.0 CUDA device: [NVIDIA GH200 96GB HBM3]
67584 bodies, total time for 5000 iterations: 26196.221 ms
= 871.805 billion interactions per second
= 17436.092 single-precision GFLOP/s at 20 flops per interaction

$ kubectl logs test-pod -c mps-ctr1 -n gpu-test-mps
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance) 
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 9.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 9.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 9.0

> Compute 9.0 CUDA device: [NVIDIA GH200 96GB HBM3]
67584 bodies, total time for 5000 iterations: 26197.629 ms
= 871.758 billion interactions per second
= 17435.154 single-precision GFLOP/s at 20 flops per interaction

@guptaNswati guptaNswati merged commit fe64609 into NVIDIA:main Jan 31, 2025
6 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
@guptaNswati guptaNswati self-assigned this Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Regression on MPS setups following changes in clusterrole

4 participants