Skip to content

Label node nvidia.com/gpu.count for MIG Config #1501

@Alja9

Description

@Alja9

Hello,
I want to ask regarding label MIG on the node.

If we configure MIG on the node, then the k8s-device-plugin will add MIG labels to the node. But for nvidia.com/gpu.count label does not update with MIG configuration. Examples :

  • Non MIG Configuration
...
                    nvidia.com/gpu.count=8
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=hopper
                    nvidia.com/gpu.machine=PowerEdge-XE9680
                    nvidia.com/gpu.memory=81559
                    nvidia.com/gpu.mode=compute
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.config=all-disabled
                    nvidia.com/mig.config.state=success
                    nvidia.com/mig.strategy=mixed
...
Capacity:
  ...
  nvidia.com/gpu:             8
  ...
Allocatable:
  ...
  nvidia.com/gpu:             8
  ...
...
  • MIG Configuration
...
                    nvidia.com/gpu.count=8
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=hopper
                    nvidia.com/gpu.machine=PowerEdge-XE9680
                    nvidia.com/gpu.memory=81559
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig-1g.10gb.count=14
                    nvidia.com/mig-1g.10gb.engines.copy=1
                    nvidia.com/mig-1g.10gb.engines.decoder=1
                    nvidia.com/mig-1g.10gb.engines.encoder=0
                    nvidia.com/mig-1g.10gb.engines.jpeg=1
                    nvidia.com/mig-1g.10gb.engines.ofa=0
                    nvidia.com/mig-1g.10gb.memory=9984
                    nvidia.com/mig-1g.10gb.multiprocessors=16
                    nvidia.com/mig-1g.10gb.product=NVIDIA-H100-80GB-HBM3-MIG-1g.10gb
                    nvidia.com/mig-1g.10gb.replicas=1
                    nvidia.com/mig-1g.10gb.slices.ci=1
                    nvidia.com/mig-1g.10gb.slices.gi=1
                    nvidia.com/mig-3g.40gb.count=5
                    nvidia.com/mig-3g.40gb.engines.copy=3
                    nvidia.com/mig-3g.40gb.engines.decoder=3
                    nvidia.com/mig-3g.40gb.engines.encoder=0
                    nvidia.com/mig-3g.40gb.engines.jpeg=3
                    nvidia.com/mig-3g.40gb.engines.ofa=0
                    nvidia.com/mig-3g.40gb.memory=40320
                    nvidia.com/mig-3g.40gb.multiprocessors=60
                    nvidia.com/mig-3g.40gb.product=NVIDIA-H100-80GB-HBM3-MIG-3g.40gb
                    nvidia.com/mig-3g.40gb.replicas=1
                    nvidia.com/mig-3g.40gb.slices.ci=3
                    nvidia.com/mig-3g.40gb.slices.gi=3
                    nvidia.com/mig-4g.40gb.count=5
                    nvidia.com/mig-4g.40gb.engines.copy=4
                    nvidia.com/mig-4g.40gb.engines.decoder=4
                    nvidia.com/mig-4g.40gb.engines.encoder=0
                    nvidia.com/mig-4g.40gb.engines.jpeg=4
                    nvidia.com/mig-4g.40gb.engines.ofa=0
                    nvidia.com/mig-4g.40gb.memory=40320
                    nvidia.com/mig-4g.40gb.multiprocessors=64
                    nvidia.com/mig-4g.40gb.product=NVIDIA-H100-80GB-HBM3-MIG-4g.40gb
                    nvidia.com/mig-4g.40gb.replicas=1
                    nvidia.com/mig-4g.40gb.slices.ci=4
                    nvidia.com/mig-4g.40gb.slices.gi=4
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.config=mig-config-26
                    nvidia.com/mig.config.state=success
                    nvidia.com/mig.strategy=mixed
...
Capacity:
  ...
  nvidia.com/gpu:             1
  nvidia.com/mig-1g.10gb:     14
  nvidia.com/mig-3g.40gb:     5
  nvidia.com/mig-4g.40gb:     5
  ...
Allocatable:
  ...
  nvidia.com/gpu:             1
  nvidia.com/mig-1g.10gb:     14
  nvidia.com/mig-3g.40gb:     5
  nvidia.com/mig-4g.40gb:     5
  ...
...

Is there any solution to make the nominal count of the nvidia.com/gpu label appear in the node label same with count in capacity or match it with the GPU count configuration in the MIG config ?
(as in the examples above, it becomes nvidia.com/gpu: 1, but in the label node does not show the count 1 and still 8)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions