Skip to content

CA: Expose node provisioning failures as Kubernetes events #8817

@MenD32

Description

@MenD32

Which component are you using?:
/area cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

There is a critical visibility gap when a scale-up "succeeds" at the node group level, but "fails" at the individual node level.

Here is the common failure scenario:

  1. A Pod becomes Pending.
  2. Cluster Autoscaler (CA) correctly identifies a node group (e.g., a spot instance group) and successfully makes an API call to increase its replica count. The CA logs this as a success.
  3. The cloud provider accepts the API call, but then fails to provision the new node. This is common with spot instances (InsufficientSpotCapacity) or in zones with low resource availability (ZONE_RESOURCE_POOL_EXHAUSTED).

Because the initial API call was successful, the CA's core logic is not always aware of this provisioning failure.

This causes two major problems:

  1. Poor Operator Visibility: The Pod stays Pending. An operator checking the CA logs sees a successful scale-up. The actual reason for the failure (e.g., InsufficientSpotCapacity) is hidden in cloud provider audit logs and is not visible in kubectl.
  2. Ineffective Failover: The CA may not effectively fail over to a different node group. Since it didn't receive an immediate error from the API call, it may not place the failing node group into backoff. This can cause the CA to "wait" for the new node (which will never arrive) instead of immediately trying a different, lower-priority node group that might be available.

Describe the solution you'd like.:

node scheduling failures should be visible both to the CA core logic, and using pod events this will make monitoring that process easier and this can help add a feature in the future where if a node group fails, another scale up of a different node group will be scheduled

Describe any alternative solutions you've considered.:

This feature request is for the foundational first step in order to solve this issue - making the provisioning failure visible inside Kubernetes.

The Cluster Autoscaler should track the result of a scale-up request. It should have a timeout (e.g., max-node-provision-time) for how long it "waits" for a new node to join.

If a node from a specific group fails to join within this time (or if the cloud provider API reports a failure), the CA should generate a Kubernetes Warning Event.

Example Event:

Type: Warning
Reason: NodeProvisioningFailed
Message: Node for 'spot-gpu-pool-a' failed to provision: Cloud provider error 'InsufficientSpotCapacity'.

This event provides immediate, kubectl-native visibility for operators, directly showing them the root cause of their Pending pods.

Additional context.:

This change provides the essential foundation for future enhancements. Once this failure is tracked and exposed as an Event, follow-up pull requests could:

  • Expose this failure as a Prometheus metric.
  • Use this failure signal to improve the CA's failover logic (e.g., by triggering a backoff).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions