Skip to content

Successfully overwrite the mig partition but cannot find the partition on node #1607

@we1yq

Description

@we1yq

I changed the config from all-3g to all-7g
$ kubectl label node rtx1 nvidia.com/mig.config=all-7g.40gb --overwrite
node/rtx1 labeled
when I check this command, it shows successfully changed the mig config
$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-03ca4983-f693-39d2-d7e0-25090fe07b2f)
MIG 7g.40gb Device 0: (UUID: MIG-a28fecf6-35ba-56a6-aab8-2be643b31249)
GPU 1: NVIDIA TITAN RTX (UUID: GPU-21058121-c040-c847-712d-da7a5cf48e4b)
GPU 2: NVIDIA TITAN RTX (UUID: GPU-edc7db6f-0fec-bc09-9cbe-5a8d2598a62e)
GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-1e09b62e-bae8-23dd-a55d-03b34ee00182)
but I cannot find the new partition on the node
$ kubectl describe node rtx1 | grep nvidia.com/mig
nvidia.com/mig.capable=true
nvidia.com/mig.config=all-7g.40gb
nvidia.com/mig.config.state=success
nvidia.com/mig.strategy=single
nvidia.com/mig-3g.20gb: 0
nvidia.com/mig-3g.20gb: 0
nvidia.com/mig-3g.20gb 0 0

$ kubectl get pod -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-gb6f5 1/1 Running 0 5m8s
gpu-operator-5798b5b564-zw5tg 1/1 Running 2 (139m ago) 179m
gpu-operator-node-feature-discovery-gc-86f6495b55-ntp72 1/1 Running 1 (154m ago) 179m
gpu-operator-node-feature-discovery-master-694467d5db-pddls 1/1 Running 2 (139m ago) 179m
gpu-operator-node-feature-discovery-worker-g89fd 1/1 Running 2 (139m ago) 179m
nvidia-container-toolkit-daemonset-96vnv 1/1 Running 1 (154m ago) 167m
nvidia-cuda-validator-vcxkr 0/1 Completed 0 5m5s
nvidia-dcgm-exporter-sp7q9 1/1 Running 0 5m8s
nvidia-device-plugin-daemonset-pb98g 0/1 CrashLoopBackOff 5 (112s ago) 5m8s
nvidia-mig-manager-kcp8m 1/1 Running 1 (154m ago) 177m
nvidia-operator-validator-m8xkj 0/1 Init:3/4 1 (2m29s ago) 5m9s

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions