-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
Which component are you using?:
/area cluster-autoscaler
What version of the component are you using?:
1.32.1 (tested also on 1.32.4 and 1.34.1)
Component version:
registry.k8s.io/autoscaling/cluster-autoscaler:v1.32.1helm.sh/chart: cluster-autoscaler-9.52.1
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version Client Version: v1.32.9-eks-113cf36 Kustomize Version: v5.5.0 Server Version: v1.32.9-eks-113cf36
What environment is this in?:
AWS EKS
What did you expect to happen?:
Reliable scale out of the EKS managed worker node group from 0 every time a dependent pod is scheduled.
What happened instead?:
This could be a follow up on #4893 .
CA scales out the EKS managed worker group from 0 only for the first time. After scale-in back to 0 it never scales out again.
Predicate "NodeResourcesFit" fails due to predicateReasons=[Insufficient nvidia.com/gpu].
How to reproduce it (as minimally and precisely as possible):
- have CA deployment scaled to 0 (verbosity 4 is enough, I tried it up to 9 and there's no extra detail which would explain the behavior)
- have EKS worker group min. 0, max. 1 (we use launch template in order to tag EBSs, NICs, instances, not GPU, should not change anything because CA does not read launch templates for some reason, IAM policy allows it though), select only one subnet for this group - pretend that you have to use pv-claim with specific EBS ARN, you don't actually need that in order to reproduce the issue, it just explains why there is only one subnet to be set.
- meet all tags and labels requirements (see details bellow)
- scale out CA deployment to 1 and let it start
- kubectl apply -f pod-gpu.yaml
cat pod-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
nodeSelector:
nodetype: gpu
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04
command: ['/bin/sh', '-c']
args: ['tail -f /dev/null']
resources:
limits:
nvidia.com/gpu: "1"
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
kubectl delete -f pod-gpu.yaml- wait until CA scales the GPU node down or help it yourself by setting the desired count back to 0
kubectl apply -f pod-gpu.yamlagain- observe problems
kubectl -n kube-system logs --tail=20 -f deployment/cluster-autoscaler-aws-cluster-autoscaler | jq . | egrep "gpu|nvidia"
"msg": "failed to find place for xxx/nvidia-smi: can't schedule pod xxx/nvidia-smi: couldn't find a matching Node with passing predicates", "msg": "Pod xxx/nvidia-smi is unschedulable", "msg": "Pod xxx/nvidia-smi can't be scheduled on eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682, predicate checking error: can't schedule pod xxx/nvidia-smi: predicate \"NodeResourcesFit\" didn't pass (predicateReasons=[Insufficient nvidia.com/gpu]; debugInfo=nodeName: \"template-node-for-eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682-6649836023768235924\")",
- run
kubectl -n kube-system logs rollout restart deployment/cluster-autoscaler-aws-cluster-autoscalerand observe it works again
Labels and tags:
-
k8s labels:
- nodetype = "gpu"
- "nvidia.com/gpu" = "true"
- "nvidia.com/gpu.present" = "true"
- "topology.ebs.csi.aws.com/zone" = "eu-central-1b"
- "az" = "az1"
-
worker node group and ASG tags:
- "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-central-1b"
- "k8s.io/cluster-autoscaler/node-template/label/az" = "az1"
- "k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu" = "true"
- "k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu.present" = "true"
- "k8s.io/cluster-autoscaler/node-template/taint/nvidia.com/gpu" = "gpu:NoSchedule"
- "k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu" = "1"
- "k8s.io/cluster-autoscaler/node-template/label/nodetype" = "gpu"
Anything else we need to know?:
- This setup already worked, it just isn't clear when exactly it stopped working because our recent platform upgrade had dependency on plenty of things.
Instead of going backwards I've decided to make sure we have everything in place for CA to scale properly out from zero. I believe I've met the requirements and indeed it scales out properly the right worker group, it just does that only once. It isn't clear to me what is the problem if it was already capable of scaling out the first time. - EC2 capacity is available (request for on-demand), I can manually set the desired size which allows K8S to schedule the pod as expected. CA can scale back in, but the scale out still does not work.
- I believe all requirements described in the documentation are met.
- ASG is mentioned in the CA log and its cache is being regularly updated
See log as a proof when I manually scale ASG out
"msg": "Node ip-X-0-X-X.eu-central-1.compute.internal unremovable: nvidia.com/gpu requested (100% of allocatable) is above the scale-down utilization threshold", "msg": "Node ip-X-0-X-X.eu-central-1.compute.internal unremovable: nvidia.com/gpu requested (100% of allocatable) is above the scale-down utilization threshold", "msg": "Node ip-X-0-X-X.eu-central-1.compute.internal cannot be removed: xxx/nvidia-smi is not replicated", "msg": "Updated ASG cache for eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682. min/max/current is 0/1/1", "msg": "Updated ASG cache for eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682. min/max/current is 0/1/1",