Skip to content

Add HPC job mapping support for DCGM Exporter #1893

@faganihajizada

Description

@faganihajizada

Problem

In HPC environments, DCGM Exporter cannot correlate GPU metrics with
workload manager job IDs (e.g., Slurm jobs) because it lacks access to
job mapping files.

According to the DCGM Exporter documentation (https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-mapping-on-dcgm-exporter), GPU-to-job mapping can be enabled by setting DCGM_HPC_JOB_MAPPING_DIR and providing access to a directory where the HPC cluster creates job mapping files.

Currently, the GPU Operator's ClusterPolicy CRD supports configuring DCGM Exporter's environment variables but does not support custom volumes/volumeMounts. This prevents HPC workload managers (like Slurm) from enabling GPU-to-job mapping in DCGM metrics.

The workflow requires:

  • DCGM Exporter container needs to mount this directory to read the mapping files
  • Environment variable: DCGM_HPC_JOB_MAPPING_DIR

The missing piece here is the mount. The ClusterPolicy CRD doesn't expose volumes or volumeMounts fields for DCGM Exporter The base DaemonSet template https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-dcgm-exporter/0800_daemonset.yaml has limited volume mounts:

Proposed Solution

Add hpcJobMapping configuration to DCGMExporterSpec allowing users
to enable and configure the job mapping directory.

PR

#1894

Testing

All existing tests pass + new test for HPC job mapping transformation

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureissue/PR that proposes a new feature or functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions