Skip to content

Wrong node capacity and allocatable when using MIG #637

@xhejtman

Description

@xhejtman

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 6.2.0-37-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd, 1.7.7
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Rancher/RKE2, 1.27.8
  • GPU Operator Version: 23.9.1.

2. Issue or feature description

When MIG is enabled, both MIG resource and nvidia.com/gpu resource are reported as allocatable:

Allocatable:
  cerit.io/gpu-count:      2
  cerit.io/gpu-mem:        0
  cpu:                     64
  ephemeral-storage:       7104643354787
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  519659388Ki
  nvidia.com/gpu:          2
  nvidia.com/mig-1g.10gb:  6
  nvidia.com/mig-2g.20gb:  4
  nvidia.com/mig-3g.40gb:  0
  pods:                    160

which means that both requests nvidia.com/gpu and nvidia.com/mig-1g.10gb can land on the node, however, the nvidia.com/gpu request fails to inject GPU.

3. Steps to reproduce the issue

Enable MIG on A100 GPU.

This may be just a bug in Kubernetes, not the gpu operator itself.

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions