-
Notifications
You must be signed in to change notification settings - Fork 113
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What happened?
Hi,
I am using KAI scheduler on my EKS cluster, my instance type is G6E which is using NVIDIA L40S.
I would like to use GPU sharing to split GPU resource in half so that 2 pod AI AGENT can run on a node
I also configrue the queue like
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: default
spec:
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: test
spec:
parentQueue: default
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
This is my test test pod:
apiVersion: v1
kind: Pod
metadata:
name: gpu-sharing-01
labels:
kai.scheduler/queue: test
annotations:
gpu-fraction: "0.5"
spec:
schedulerName: kai-scheduler
tolerations:
- effect: NoSchedule
key: ai-type
operator: Equal
value: strong
- effect: NoExecute
key: ai-type
operator: Equal
value: strong
- effect: NoSchedule
key: nvidia.com/gpu
value: "true"
containers:
- name: ubuntu
image: ubuntu
args: ["sleep", "infinity"]
But the pod get stuck in pending state, when I try to describe the pod, I got this error:
ERROR Reached timeout while waiting for GPU reservation pod to be allocated {"controller": "bindrequest", "controllerGroup": "scheduling.run.ai", "controllerKind": "BindRequest", "BindRequest": {"name":"gpu-sharing-01","namespace":"test"}, "namespace": "test", "name": "gpu-sharing-01", "reconcileID": "41b3665e-0a46-4730-b84a-abf0da9bd1ed", "nodeName": "ip-10-0-3-95.eu-central-1.compute.internal", "name": "gpu-reservation-ip-10-0-3-95.eu-central-1.compute.internal-4vhm5", "error": "timeout"}
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation.(*service).waitForGPUReservationPodAllocation
/local/pkg/binder/binding/resourcereservation/resource_reservation.go:424
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation.(*service).createGPUReservationPodAndGetIndex
/local/pkg/binder/binding/resourcereservation/resource_reservation.go:333
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation.(*service).acquireGPUIndexByGroup
/local/pkg/binder/binding/resourcereservation/resource_reservation.go:299
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation.(*service).ReserveGpuDevice
/local/pkg/binder/binding/resourcereservation/resource_reservation.go:209
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding.(*Binder).reserveGPUs
/local/pkg/binder/binding/binder.go:101
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding.(*Binder).Bind
/local/pkg/binder/binding/binder.go:53
github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers.(*BindRequestReconciler).Reconcile
/local/pkg/binder/controllers/bindrequest_controller.go:155
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:334
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255
Is the gpu sharing only worked for GPU supports MIG?
If the anwser of above question is yes, could you suggest me some solution on this case? We didn't ahve anough budget to use higher NVIDIA like NVIDIA H100 or H200
What did you expect to happen?
We areusing NVIDIA L40S and expect to deploy 2 pod into a node
Environment
- Kubernetes version: v1.33.1
- KAI Scheduler version: v0.9.2
- Cloud provider or hardware configuration: AWS Elastic Kubernetes Service
- Tools that you are using KAI together with: Helm, Nvidia GPU-Operator
- Anything else that is relevant: None
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working