Issue with autoscaler scheduling

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._


### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
* Kernel Version: 5.15.0-1057-aws
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.27
* GPU Operator Version: v23.9.2


### 2. Issue or feature description
_Briefly explain the issue in terms of expected behavior and current behavior._

We are using gpu-operator for managing GPU drivers on k8s autoscaler managed EC2 instances. The autoscaler is configured to scale up from 0 to 8 Instances.

The issue we are seeing is that starting a single GPU workload will trigger multiple (~4) node scale-ups before marking all but one as unneeded and scaling them back down.

We also attempted the following with no success: https://github.com/NVIDIA/gpu-operator/issues/140#issuecomment-847998871

Currently, our best guess is that since the first upcoming node is already marked as ready even though the GPU operator has not finished setup. Thus, no GPU resource is available which causes our GPU workload Pod to remain in an unschedulable state.

The cluster auto-scheduler will therefore see that the node is ready but the workload pod is still unschedulable, thus triggering an additional scale-up. This process will repeat until the first node completed the GPU setup, providing the requested GPU resource and making the workload pod schedulable.

### 3. Steps to reproduce the issue
_Detailed steps to reproduce the issue._

1. Autoscaling setup with minimum nodes 0
2. Start a workload requesting GPU resources


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with autoscaler scheduling #708

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with autoscaler scheduling #708

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions