Skip to content

CA scales GPU node from 0 only the first time it starts and then it reports insufficient nvidia.com/gpu #8799

@rastakajakwanna

Description

@rastakajakwanna

Which component are you using?:

/area cluster-autoscaler

What version of the component are you using?:

1.32.1 (tested also on 1.32.4 and 1.34.1)

Component version:

  • registry.k8s.io/autoscaling/cluster-autoscaler:v1.32.1
  • helm.sh/chart: cluster-autoscaler-9.52.1

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.32.9-eks-113cf36
Kustomize Version: v5.5.0
Server Version: v1.32.9-eks-113cf36

What environment is this in?:

AWS EKS

What did you expect to happen?:

Reliable scale out of the EKS managed worker node group from 0 every time a dependent pod is scheduled.

What happened instead?:

This could be a follow up on #4893 .
CA scales out the EKS managed worker group from 0 only for the first time. After scale-in back to 0 it never scales out again.

Predicate "NodeResourcesFit" fails due to predicateReasons=[Insufficient nvidia.com/gpu].

How to reproduce it (as minimally and precisely as possible):

  • have CA deployment scaled to 0 (verbosity 4 is enough, I tried it up to 9 and there's no extra detail which would explain the behavior)
  • have EKS worker group min. 0, max. 1 (we use launch template in order to tag EBSs, NICs, instances, not GPU, should not change anything because CA does not read launch templates for some reason, IAM policy allows it though), select only one subnet for this group - pretend that you have to use pv-claim with specific EBS ARN, you don't actually need that in order to reproduce the issue, it just explains why there is only one subnet to be set.
  • meet all tags and labels requirements (see details bellow)
  • scale out CA deployment to 1 and let it start
  • kubectl apply -f pod-gpu.yaml
cat pod-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  nodeSelector:
    nodetype: gpu
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04
    command: ['/bin/sh', '-c']
    args: ['tail -f /dev/null']
    resources:
      limits:
        nvidia.com/gpu: "1"
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  • kubectl delete -f pod-gpu.yaml
  • wait until CA scales the GPU node down or help it yourself by setting the desired count back to 0
  • kubectl apply -f pod-gpu.yaml again
  • observe problems
kubectl -n kube-system logs --tail=20 -f deployment/cluster-autoscaler-aws-cluster-autoscaler | jq . | egrep "gpu|nvidia"
"msg": "failed to find place for xxx/nvidia-smi: can't schedule pod xxx/nvidia-smi: couldn't find a matching Node with passing predicates",
"msg": "Pod xxx/nvidia-smi is unschedulable",
"msg": "Pod xxx/nvidia-smi can't be scheduled on eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682, predicate checking error: can't schedule pod xxx/nvidia-smi: predicate \"NodeResourcesFit\" didn't pass (predicateReasons=[Insufficient nvidia.com/gpu]; debugInfo=nodeName: \"template-node-for-eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682-6649836023768235924\")",
  • run kubectl -n kube-system logs rollout restart deployment/cluster-autoscaler-aws-cluster-autoscaler and observe it works again

Labels and tags:

  • k8s labels:

    • nodetype = "gpu"
    • "nvidia.com/gpu" = "true"
    • "nvidia.com/gpu.present" = "true"
    • "topology.ebs.csi.aws.com/zone" = "eu-central-1b"
    • "az" = "az1"
  • worker node group and ASG tags:

    • "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-central-1b"
    • "k8s.io/cluster-autoscaler/node-template/label/az" = "az1"
    • "k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu" = "true"
    • "k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu.present" = "true"
    • "k8s.io/cluster-autoscaler/node-template/taint/nvidia.com/gpu" = "gpu:NoSchedule"
    • "k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu" = "1"
    • "k8s.io/cluster-autoscaler/node-template/label/nodetype" = "gpu"

Anything else we need to know?:

  • This setup already worked, it just isn't clear when exactly it stopped working because our recent platform upgrade had dependency on plenty of things.
    Instead of going backwards I've decided to make sure we have everything in place for CA to scale properly out from zero. I believe I've met the requirements and indeed it scales out properly the right worker group, it just does that only once. It isn't clear to me what is the problem if it was already capable of scaling out the first time.
  • EC2 capacity is available (request for on-demand), I can manually set the desired size which allows K8S to schedule the pod as expected. CA can scale back in, but the scale out still does not work.
  • I believe all requirements described in the documentation are met.
    • ASG is mentioned in the CA log and its cache is being regularly updated
See log as a proof when I manually scale ASG out
  "msg": "Node ip-X-0-X-X.eu-central-1.compute.internal unremovable: nvidia.com/gpu requested (100% of allocatable) is above the scale-down utilization threshold",
  "msg": "Node ip-X-0-X-X.eu-central-1.compute.internal unremovable: nvidia.com/gpu requested (100% of allocatable) is above the scale-down utilization threshold",
  "msg": "Node ip-X-0-X-X.eu-central-1.compute.internal cannot be removed: xxx/nvidia-smi is not replicated",
  "msg": "Updated ASG cache for eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682. min/max/current is 0/1/1",
  "msg": "Updated ASG cache for eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682. min/max/current is 0/1/1",
- These tags and labels are obviously set, see https://github.com//issues/3869#issuecomment-825512767

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions