CA scales GPU node from 0 only the first time it starts and then it reports insufficient nvidia.com/gpu

**Which component are you using?**:

/area cluster-autoscaler



**What version of the component are you using?**:

`1.32.1` (tested also on `1.32.4` and `1.34.1`)


Component version:

* `registry.k8s.io/autoscaling/cluster-autoscaler:v1.32.1`
* `helm.sh/chart: cluster-autoscaler-9.52.1`

**What k8s version are you using (`kubectl version`)?**:

<details><summary><code>kubectl version</code> Output</summary><br><pre>
$ kubectl version
Client Version: v1.32.9-eks-113cf36
Kustomize Version: v5.5.0
Server Version: v1.32.9-eks-113cf36
</pre></details>

**What environment is this in?**:

AWS EKS


**What did you expect to happen?**:

Reliable scale out of the EKS managed worker node group from 0 every time a dependent pod is scheduled.


**What happened instead?**:

This could be a follow up on #4893 .
CA scales out the EKS managed worker group from 0 only for the first time. After scale-in back to 0 it never scales out again.

Predicate `"NodeResourcesFit"` fails due to `predicateReasons=[Insufficient nvidia.com/gpu]`.


**How to reproduce it (as minimally and precisely as possible)**:

* have CA deployment scaled to 0 (verbosity 4 is enough, I tried it up to 9 and there's no extra detail which would explain the behavior)
* have EKS worker group min. 0, max. 1 (we use launch template in order to tag EBSs, NICs, instances, not GPU, should not change anything because CA does not read launch templates for some reason, IAM policy allows it though), select only one subnet for this group - pretend that you have to use pv-claim with specific EBS ARN, you don't actually need that in order to reproduce the issue, it just explains why there is only one subnet to be set.
* meet all tags and labels requirements (see details bellow)
* scale out CA deployment to 1 and let it start
* kubectl apply -f pod-gpu.yaml
<details><summary><code>cat pod-gpu.yaml</code></summary><br><pre>
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  nodeSelector:
    nodetype: gpu
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04
    command: ['/bin/sh', '-c']
    args: ['tail -f /dev/null']
    resources:
      limits:
        nvidia.com/gpu: "1"
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
</pre></details>

* `kubectl delete -f pod-gpu.yaml`
* wait until CA scales the GPU node down or help it yourself by setting the desired count back to 0
* `kubectl apply -f pod-gpu.yaml` again
* observe problems
<details><summary><code>kubectl -n kube-system logs --tail=20 -f deployment/cluster-autoscaler-aws-cluster-autoscaler | jq . | egrep "gpu|nvidia"</code></summary><br><pre>
"msg": "failed to find place for xxx/nvidia-smi: can't schedule pod xxx/nvidia-smi: couldn't find a matching Node with passing predicates",
"msg": "Pod xxx/nvidia-smi is unschedulable",
"msg": "Pod xxx/nvidia-smi can't be scheduled on eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682, predicate checking error: can't schedule pod xxx/nvidia-smi: predicate \"NodeResourcesFit\" didn't pass (predicateReasons=[Insufficient nvidia.com/gpu]; debugInfo=nodeName: \"template-node-for-eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682-6649836023768235924\")",
</pre></details>

* run `kubectl -n kube-system logs rollout restart deployment/cluster-autoscaler-aws-cluster-autoscaler` and observe it works again

*Labels and tags:*

- k8s labels:
   - nodetype = "gpu"
   - "nvidia.com/gpu" = "true"
   - "nvidia.com/gpu.present" = "true"
   - "topology.ebs.csi.aws.com/zone" = "eu-central-1b"
   - "az" = "az1"

- worker node group and ASG tags:
  - "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-central-1b"
  - "k8s.io/cluster-autoscaler/node-template/label/az" = "az1"
  - "k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu" = "true"
  - "k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu.present" = "true"
  - "k8s.io/cluster-autoscaler/node-template/taint/nvidia.com/gpu" = "gpu:NoSchedule"
  - "k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu" = "1"
  - "k8s.io/cluster-autoscaler/node-template/label/nodetype" = "gpu"


**Anything else we need to know?**:

- This setup already worked, it just isn't clear when exactly it stopped working because our recent platform upgrade had dependency on plenty of things.
Instead of going backwards I've decided to make sure we have everything in place for CA to scale properly out from zero. I believe I've met the requirements and indeed it scales out properly the right worker group, it just does that only once. It isn't clear to me what is the problem if it was already capable of scaling out the first time.
- EC2 capacity is available (request for on-demand), I can manually set the desired size which allows K8S to schedule the pod as expected. CA can scale back in, but the scale out still does not work.
- I believe all requirements described in [the documentation](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup) are met.
  - ASG is mentioned in the CA log and its cache is being regularly updated
<details><summary><code>See log as a proof when I manually scale ASG out</code></summary><br><pre>
  "msg": "Node ip-X-0-X-X.eu-central-1.compute.internal unremovable: nvidia.com/gpu requested (100% of allocatable) is above the scale-down utilization threshold",
  "msg": "Node ip-X-0-X-X.eu-central-1.compute.internal unremovable: nvidia.com/gpu requested (100% of allocatable) is above the scale-down utilization threshold",
  "msg": "Node ip-X-0-X-X.eu-central-1.compute.internal cannot be removed: xxx/nvidia-smi is not replicated",
  "msg": "Updated ASG cache for eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682. min/max/current is 0/1/1",
  "msg": "Updated ASG cache for eks-gpu-3e3b-44cd2a5d-a57a-9950-1e9b-dbfccb896682. min/max/current is 0/1/1",
</pre></details>
- These tags and labels are obviously set, see https://github.com/kubernetes/autoscaler/issues/3869#issuecomment-825512767

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA scales GPU node from 0 only the first time it starts and then it reports insufficient nvidia.com/gpu #8799

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CA scales GPU node from 0 only the first time it starts and then it reports insufficient nvidia.com/gpu #8799

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions