-
Notifications
You must be signed in to change notification settings - Fork 412
Description
Problem
In HPC environments, DCGM Exporter cannot correlate GPU metrics with
workload manager job IDs (e.g., Slurm jobs) because it lacks access to
job mapping files.
According to the DCGM Exporter documentation (https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-mapping-on-dcgm-exporter), GPU-to-job mapping can be enabled by setting DCGM_HPC_JOB_MAPPING_DIR and providing access to a directory where the HPC cluster creates job mapping files.
Currently, the GPU Operator's ClusterPolicy CRD supports configuring DCGM Exporter's environment variables but does not support custom volumes/volumeMounts. This prevents HPC workload managers (like Slurm) from enabling GPU-to-job mapping in DCGM metrics.
The workflow requires:
- DCGM Exporter container needs to mount this directory to read the mapping files
- Environment variable:
DCGM_HPC_JOB_MAPPING_DIR
The missing piece here is the mount. The ClusterPolicy CRD doesn't expose volumes or volumeMounts fields for DCGM Exporter The base DaemonSet template https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-dcgm-exporter/0800_daemonset.yaml has limited volume mounts:
Proposed Solution
Add hpcJobMapping configuration to DCGMExporterSpec allowing users
to enable and configure the job mapping directory.
PR
Testing
All existing tests pass + new test for HPC job mapping transformation