Skip to content

When MIG uses the "mixed" mode, the nvidia-cuda-validator and nvidia-operator-validator Pods are always in the Init state #1738

@biqiangwu

Description

@biqiangwu

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
My host only has one A100 card. When installing NVIDIA-Operator, the MIG mixed mode was enabled, and at this time, all Pods were normal. After specifying the MIG custom configuration and setting the corresponding labels for the nodes, the nvidia-cuda-validator and nvidia-operator-validator Pods remain in the init state

To Reproduce

helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v25.3.4 \
    --set mig.strategy=single

-------
All the Pods are functioning properly

kn get po 
NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-tn58t                                       1/1     Running     0             18s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj   1/1     Running     1 (96m ago)   111m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh   1/1     Running     1 (96m ago)   111m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v       1/1     Running     1 (96m ago)   111m
gpu-operator-75dff77d5c-4cctc                                     1/1     Running     1 (96m ago)   111m
nvidia-container-toolkit-daemonset-cqpcm                          1/1     Running     0             95s
nvidia-cuda-validator-7nxmf                                       0/1     Completed   0             10s
nvidia-dcgm-exporter-kt58n                                        1/1     Running     0             18s
nvidia-device-plugin-daemonset-njhr8                              1/1     Running     0             18s
nvidia-driver-daemonset-7hl9h                                     1/1     Running     0             2m7s
nvidia-mig-manager-4f4z5                                          1/1     Running     0             95s
nvidia-operator-validator-jtl8t                                   1/1     Running     0             95s

-------
Take effect with the custom MIG configuration

cat custom-mig-config.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      
      two-1g-one-2g:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 2
            "2g.20gb": 1

k apply -f custom-mig-config.yaml

--------------
Use custom configuration

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
    --type='json' \
    -p='[{"op":"replace", "path":"/spec/migManager/config/name", "value":"custom-mig-config"}]'

--------------
All the Pods are functioning properly

kn get po 
NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-tn58t                                       1/1     Running     0             3m4s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj   1/1     Running     1 (99m ago)   114m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh   1/1     Running     1 (99m ago)   114m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v       1/1     Running     1 (99m ago)   114m
gpu-operator-75dff77d5c-4cctc                                     1/1     Running     1 (99m ago)   114m
nvidia-container-toolkit-daemonset-cqpcm                          1/1     Running     0             4m21s
nvidia-cuda-validator-7nxmf                                       0/1     Completed   0             2m56s
nvidia-dcgm-exporter-kt58n                                        1/1     Running     0             3m4s
nvidia-device-plugin-daemonset-njhr8                              1/1     Running     0             3m4s
nvidia-driver-daemonset-7hl9h                                     1/1     Running     0             4m53s
nvidia-mig-manager-ztdbj                                          1/1     Running     0             13s    # restart
nvidia-operator-validator-jtl8t                                   1/1     Running     0             4m21s

--------------
Label the nodes

k label nodes n1 nvidia.com/mig.config=two-1g-one-2g --overwrite

--------------
An error occurred

kn get po 
NAME                                                              READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-6rgp2                                       1/1     Running                 0              15s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj   1/1     Running                 1 (102m ago)   117m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh   1/1     Running                 1 (102m ago)   117m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v       1/1     Running                 1 (102m ago)   117m
gpu-operator-75dff77d5c-4cctc                                     1/1     Running                 1 (102m ago)   117m
nvidia-container-toolkit-daemonset-cqpcm                          1/1     Running                 0              7m7s
nvidia-cuda-validator-h7h8p                                       0/1     Init:CrashLoopBackOff   1 (12s ago)    13s    # error
nvidia-dcgm-exporter-z2jjp                                        1/1     Running                 0              15s
nvidia-device-plugin-daemonset-m9bct                              1/1     Running                 0              15s
nvidia-driver-daemonset-7hl9h                                     1/1     Running                 0              7m39s
nvidia-mig-manager-ztdbj                                          1/1     Running                 0              2m59s
nvidia-operator-validator-zg6kt                                   0/1     Init:2/4                0              16s

--------------
kn describe po nvidia-cuda-validator-h7h8p
.....
Events:
  Type     Reason   Age                From     Message
  ----     ------   ----               ----     -------
  Normal   Pulled   38s (x4 over 74s)  kubelet  Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.4" already present on machine
  Normal   Created  38s (x4 over 74s)  kubelet  Created container: cuda-validation
  Normal   Started  38s (x4 over 74s)  kubelet  Started container cuda-validation
  Warning  BackOff  9s (x7 over 73s)   kubelet  Back-off restarting failed container cuda-validation in pod nvidia-cuda-validator-h7h8p_gpu-operator(e5d64f72-86c8-4c90-936a-aa59a005abba)

--------------
kn exec -it nvidia-driver-daemonset-7hl9h -- nvidia-smi 
Mon Sep 29 09:07:14 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:02:00.0 Off |                   On |
| N/A   51C    P0             94W /  250W |       0MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |              Shared Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                Shared BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+


--------------
Everything is fine in single mode

k describe node n1 | grep Capacity -A 8
Capacity:
  cpu:                     20
  ephemeral-storage:       1966788624Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  65231232Ki
  nvidia.com/gpu:          4
  nvidia.com/mig-1g.10gb:  4
  pods:                    110


Expected behavior
NVIDIA Operator is running normally

Environment (please provide the following information):

  • GPU Operator Version: [e.g. v25.3.4]
  • OS: [e.g. Ubuntu24.04]
  • Kernel Version: [6.8.0-84-generic]
  • Container Runtime Version: [e.g. containerd v2.1.4 ]
  • Kubernetes Distro and Version: [K8s v1.34.0]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions