Skip to content

Issue with GPU sharing on NVIDIA L40S #522

@barbarian23

Description

@barbarian23

What happened?

Hi,
I am using KAI scheduler on my EKS cluster, my instance type is G6E which is using NVIDIA L40S.
I would like to use GPU sharing to split GPU resource in half so that 2 pod AI AGENT can run on a node
I also configrue the queue like

apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
  name: default
spec:
  resources:
    cpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    gpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    memory:
      quota: -1
      limit: -1
      overQuotaWeight: 1
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
  name: test
spec:
  parentQueue: default
  resources:
    cpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    gpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    memory:
      quota: -1
      limit: -1
      overQuotaWeight: 1

This is my test test pod:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-sharing-01
  labels:
    kai.scheduler/queue: test
  annotations:
    gpu-fraction: "0.5"
spec:
  schedulerName: kai-scheduler
  tolerations:     
    - effect: NoSchedule
      key: ai-type
      operator: Equal
      value: strong
    - effect: NoExecute
      key: ai-type
      operator: Equal
      value: strong
    - effect: NoSchedule  
      key: nvidia.com/gpu
      value: "true"
  containers:
    - name: ubuntu
      image: ubuntu
      args: ["sleep", "infinity"]      

But the pod get stuck in pending state, when I try to describe the pod, I got this error:

  ERROR   Reached timeout while waiting for GPU reservation pod to be allocated   {"controller": "bindrequest", "controllerGroup": "scheduling.run.ai", "controllerKind": "BindRequest", "BindRequest": {"name":"gpu-sharing-01","namespace":"test"}, "namespace": "test", "name": "gpu-sharing-01", "reconcileID": "41b3665e-0a46-4730-b84a-abf0da9bd1ed", "nodeName": "ip-10-0-3-95.eu-central-1.compute.internal", "name": "gpu-reservation-ip-10-0-3-95.eu-central-1.compute.internal-4vhm5", "error": "timeout"}
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation.(*service).waitForGPUReservationPodAllocation
        /local/pkg/binder/binding/resourcereservation/resource_reservation.go:424
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation.(*service).createGPUReservationPodAndGetIndex
        /local/pkg/binder/binding/resourcereservation/resource_reservation.go:333
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation.(*service).acquireGPUIndexByGroup
        /local/pkg/binder/binding/resourcereservation/resource_reservation.go:299
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation.(*service).ReserveGpuDevice
        /local/pkg/binder/binding/resourcereservation/resource_reservation.go:209
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding.(*Binder).reserveGPUs
        /local/pkg/binder/binding/binder.go:101
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding.(*Binder).Bind
        /local/pkg/binder/binding/binder.go:53
github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers.(*BindRequestReconciler).Reconcile
        /local/pkg/binder/controllers/bindrequest_controller.go:155
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:334
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255

Is the gpu sharing only worked for GPU supports MIG?
If the anwser of above question is yes, could you suggest me some solution on this case? We didn't ahve anough budget to use higher NVIDIA like NVIDIA H100 or H200

What did you expect to happen?

We areusing NVIDIA L40S and expect to deploy 2 pod into a node

Environment

  • Kubernetes version: v1.33.1
  • KAI Scheduler version: v0.9.2
  • Cloud provider or hardware configuration: AWS Elastic Kubernetes Service
  • Tools that you are using KAI together with: Helm, Nvidia GPU-Operator
  • Anything else that is relevant: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions