Should simultaneous allocation of a GPU and its MIG partitions be allowed? #712
-
|
Once a Pod has already been allocated a MIG partition for a GPU, should another Pod be able to claim the whole parent GPU? In my cluster I have two nodes each with two A100 GPUs. One is MIG-disabled and the other node has both GPUs divided into 4 partitions each. Workloads that consume all of the MIG partitions on the MIG node and all of the GPUs on both nodes are able to schedule, where I'd expect allocating either all of the MIGs or all of the GPUs on the MIG node to fail since they're the same underlying hardware. workloads.yamlapiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: all-migs
spec:
spec:
devices:
requests:
- name: migs
exactly:
deviceClassName: mig.nvidia.com
allocationMode: All
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: all-gpus
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
allocationMode: All
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: all-migs
spec:
replicas: 1
selector:
matchLabels:
app: all-migs
template:
metadata:
labels:
app: all-migs
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: migs
resourceClaims:
- name: migs
resourceClaimTemplateName: all-migs
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: all-gpus
spec:
replicas: 2
selector:
matchLabels:
app: all-gpus
template:
metadata:
labels:
app: all-gpus
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: gpus
resourceClaims:
- name: gpus
resourceClaimTemplateName: all-gpusresourceslices.yamlapiVersion: v1
items:
- apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
creationTimestamp: "2025-11-05T17:31:13Z"
generateName: aks-gpupool-15127565-vmss000002-gpu.nvidia.com-
generation: 1
name: aks-gpupool-15127565-vmss000002-gpu.nvidia.com-sdzwp
ownerReferences:
- apiVersion: v1
controller: true
kind: Node
name: aks-gpupool-15127565-vmss000002
uid: 8816457f-9747-456c-86fc-6dd6fbaee6bb
resourceVersion: "512490"
uid: 32e7cc60-f7b0-42db-a030-d78a7004d0cb
spec:
devices:
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
pcieBusID:
string: "0001:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
type:
string: gpu
uuid:
string: GPU-199490ab-88da-c1e2-0f96-631264c2f512
capacity:
memory:
value: 80Gi
name: gpu-0
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
pcieBusID:
string: "0002:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
type:
string: gpu
uuid:
string: GPU-13719f71-412c-d11e-b236-a262f530a4fc
capacity:
memory:
value: 80Gi
name: gpu-1
driver: gpu.nvidia.com
nodeName: aks-gpupool-15127565-vmss000002
pool:
generation: 1
name: aks-gpupool-15127565-vmss000002
resourceSliceCount: 1
- apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
creationTimestamp: "2025-11-05T17:31:23Z"
generateName: aks-migpool-68842551-vmss000001-gpu.nvidia.com-
generation: 1
name: aks-migpool-68842551-vmss000001-gpu.nvidia.com-22vnt
ownerReferences:
- apiVersion: v1
controller: true
kind: Node
name: aks-migpool-68842551-vmss000001
uid: 728d24f6-eb2a-4ffb-9364-1de88566f990
resourceVersion: "512562"
uid: f859ed88-842a-46a0-8bd0-085ee5d3353d
spec:
devices:
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
parentUUID:
string: GPU-ddf64803-f421-0e5b-883e-cd752de16db7
pcieBusID:
string: "0001:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
profile:
string: 3g.40gb
type:
string: mig
uuid:
string: MIG-e774da9b-cceb-5e61-91c3-3b1f58636990
capacity:
copyEngines:
value: "3"
decoders:
value: "2"
encoders:
value: "0"
jpegEngines:
value: "0"
memory:
value: 40192Mi
memorySlice4:
value: "1"
memorySlice5:
value: "1"
memorySlice6:
value: "1"
memorySlice7:
value: "1"
multiprocessors:
value: "42"
ofaEngines:
value: "0"
name: gpu-0-mig-9-4-4
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
parentUUID:
string: GPU-ddf64803-f421-0e5b-883e-cd752de16db7
pcieBusID:
string: "0001:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
profile:
string: 2g.20gb
type:
string: mig
uuid:
string: MIG-b8e88dd7-8b46-5717-91e6-71f276dd31dc
capacity:
copyEngines:
value: "2"
decoders:
value: "1"
encoders:
value: "0"
jpegEngines:
value: "0"
memory:
value: 19968Mi
memorySlice0:
value: "1"
memorySlice1:
value: "1"
multiprocessors:
value: "28"
ofaEngines:
value: "0"
name: gpu-0-mig-14-0-2
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
pcieBusID:
string: "0002:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
type:
string: gpu
uuid:
string: GPU-61572322-0336-adb6-f996-0ae0cf40abcc
capacity:
memory:
value: 80Gi
name: gpu-1
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
parentUUID:
string: GPU-61572322-0336-adb6-f996-0ae0cf40abcc
pcieBusID:
string: "0002:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
profile:
string: 2g.20gb
type:
string: mig
uuid:
string: MIG-be872654-574b-558c-bcb1-fb5d5a9891ee
capacity:
copyEngines:
value: "2"
decoders:
value: "1"
encoders:
value: "0"
jpegEngines:
value: "0"
memory:
value: 19968Mi
memorySlice0:
value: "1"
memorySlice1:
value: "1"
multiprocessors:
value: "28"
ofaEngines:
value: "0"
name: gpu-1-mig-14-0-2
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
pcieBusID:
string: "0001:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
type:
string: gpu
uuid:
string: GPU-ddf64803-f421-0e5b-883e-cd752de16db7
capacity:
memory:
value: 80Gi
name: gpu-0
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
parentUUID:
string: GPU-ddf64803-f421-0e5b-883e-cd752de16db7
pcieBusID:
string: "0001:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
profile:
string: 1g.10gb
type:
string: mig
uuid:
string: MIG-d74574ee-26ab-5d71-ad7a-dc6c515ac642
capacity:
copyEngines:
value: "1"
decoders:
value: "0"
encoders:
value: "0"
jpegEngines:
value: "0"
memory:
value: 9728Mi
memorySlice2:
value: "1"
multiprocessors:
value: "14"
ofaEngines:
value: "0"
name: gpu-0-mig-19-2-1
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
parentUUID:
string: GPU-ddf64803-f421-0e5b-883e-cd752de16db7
pcieBusID:
string: "0001:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
profile:
string: 1g.10gb
type:
string: mig
uuid:
string: MIG-6a0297e8-9708-515f-99e2-f111fc87221d
capacity:
copyEngines:
value: "1"
decoders:
value: "0"
encoders:
value: "0"
jpegEngines:
value: "0"
memory:
value: 9728Mi
memorySlice3:
value: "1"
multiprocessors:
value: "14"
ofaEngines:
value: "0"
name: gpu-0-mig-19-3-1
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
parentUUID:
string: GPU-61572322-0336-adb6-f996-0ae0cf40abcc
pcieBusID:
string: "0002:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
profile:
string: 1g.10gb
type:
string: mig
uuid:
string: MIG-4e3c5199-f362-5ed3-82e8-58b19759fc12
capacity:
copyEngines:
value: "1"
decoders:
value: "0"
encoders:
value: "0"
jpegEngines:
value: "0"
memory:
value: 9728Mi
memorySlice2:
value: "1"
multiprocessors:
value: "14"
ofaEngines:
value: "0"
name: gpu-1-mig-19-2-1
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
parentUUID:
string: GPU-61572322-0336-adb6-f996-0ae0cf40abcc
pcieBusID:
string: "0002:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
profile:
string: 1g.10gb
type:
string: mig
uuid:
string: MIG-96aa58ce-6075-57da-87a4-bf1c3b5ade79
capacity:
copyEngines:
value: "1"
decoders:
value: "0"
encoders:
value: "0"
jpegEngines:
value: "0"
memory:
value: 9728Mi
memorySlice3:
value: "1"
multiprocessors:
value: "14"
ofaEngines:
value: "0"
name: gpu-1-mig-19-3-1
- attributes:
architecture:
string: Ampere
brand:
string: Nvidia
cudaComputeCapability:
version: 8.0.0
cudaDriverVersion:
version: 13.0.0
driverVersion:
version: 580.95.5
parentUUID:
string: GPU-61572322-0336-adb6-f996-0ae0cf40abcc
pcieBusID:
string: "0002:00:00.0"
productName:
string: NVIDIA A100 80GB PCIe
profile:
string: 3g.40gb
type:
string: mig
uuid:
string: MIG-a642816e-3a0d-529c-9263-00b38d842101
capacity:
copyEngines:
value: "3"
decoders:
value: "2"
encoders:
value: "0"
jpegEngines:
value: "0"
memory:
value: 40192Mi
memorySlice4:
value: "1"
memorySlice5:
value: "1"
memorySlice6:
value: "1"
memorySlice7:
value: "1"
multiprocessors:
value: "42"
ofaEngines:
value: "0"
name: gpu-1-mig-9-4-4
driver: gpu.nvidia.com
nodeName: aks-migpool-68842551-vmss000001
pool:
generation: 1
name: aks-migpool-68842551-vmss000001
resourceSliceCount: 1
kind: List
metadata:
resourceVersion: ""I encountered this scenario on AKS with Kubernetes v1.34.0 using only GA DRA features. The cluster has GPU Operator v25.10.0 (with the device plugin disabled) and v25.8.0 of the NVIDIA DRA driver (with GPUs enabled). |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
|
This should not happen. Definitely a bug. |
Beta Was this translation helpful? Give feedback.
This should not happen. Definitely a bug.