Expose per Pod GPU metrics collected by dcgm-exporter to include DRA information #652
guptaNswati
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
To start adding DRA information. Follow these steps.
Step 1: Update to k8s 1.32 and enable DRA and PodResourcesDRA feature gates. These are necessary to add DRA information to the podresources api. Refer kubernetes/enhancements#3695
Step 2: Disable k8s-device-plugin. And install nvidia-k8s-dra-driver with
--set gpuResourcesEnabledOverride=trueoption to enable gpu allocation. Follow the official guide or this for multinode testing #249Step 3: Install dcgm-exporter version
DCGM-Exporter 4.3.1-4.4.0with--set kubernetesDRA.enabled=trueoption.If running with gpu-operator, manually create the clusterrole and bindings to allow dcgm-exporter to access resourceslices and add this env var to the exporter pod:
Step 4: Deploy an example gpu pod DRA style
Step 5: Follow dcgm-exporter instructions to see the metrics
Beta Was this translation helpful? Give feedback.
All reactions