Skip to content

Conversation

@TheOutdoorProgrammer
Copy link
Member

Summary

Adds defensive scaling safeguards to detect and recover from unreasonably high ASG desired capacity values that can occur during AWS service disruptions or external modifications.

Motivation

During AWS service disruptions, the Auto Scaling Group desired capacity can be incorrectly set to extremely high values (e.g., 79 instances when only 3 workers and 5 pending runs exist). This can cause massive unnecessary scaling, leading to significant cost overruns and operational issues. We need a mechanism to detect these anomalies and automatically reset to sane capacity values.

Changes

  • Sanity check logic in autoscaler: Before normal scaling operations, calculate expected maximum capacity based on valid workers, pending runs, and scaling buffer. If ASG desired capacity significantly exceeds this value, log an error and reset to a reasonable value.
  • Graceful handling of invalid worker metadata: Workers with missing or invalid metadata are now logged and skipped rather than causing fatal errors, preventing invalid workers from blocking autoscaling operations.
  • New ValidWorkerCount() method: Track only workers with valid metadata separately from total worker count, ensuring scaling decisions are based on actual functional workers.
  • Updated scaling decision logic: All scaling decisions now use validWorkerCount instead of total worker count to prevent stray/invalid workers from affecting capacity calculations.
  • Comprehensive test coverage: Added tests for sanity check scenarios including high capacity detection, min/max size boundary handling, and reasonable capacity validation.

Features

  • Configurable sanity threshold: New AUTOSCALING_CAPACITY_SANITY_CHECK environment variable (defaults to 10) controls the excess capacity threshold that triggers defensive reset.
  • Automatic capacity recovery: When suspicious capacity is detected, automatically resets to valid_workers + pending_runs (respecting ASG min/max constraints).
  • Non-fatal error handling: Invalid worker metadata no longer blocks autoscaling; workers are skipped with warning logs and operations continue.
  • Detailed logging: Error and warning logs include full context (current capacity, valid workers, pending runs, expected capacity, difference) for debugging and alerting.

Usage

The sanity check runs automatically on every scaling cycle. To customize the threshold:

# Set threshold to 20 instances instead of default 10
export AUTOSCALING_CAPACITY_SANITY_CHECK=20

The sanity check triggers when:

excess_capacity = asg_desired_capacity - (valid_workers + pending_runs + max_create)
if excess_capacity >= sanity_threshold:
    # Reset capacity to valid_workers + pending_runs

When triggered, logs will show:

  • ERROR: "ASG desired capacity is suspiciously high" with full diagnostic context
  • WARN: "attempting to reset ASG desired capacity to sane value" with old/new values
  • INFO: "successfully reset ASG desired capacity" on successful recovery

@TheOutdoorProgrammer TheOutdoorProgrammer merged commit de8c964 into main Nov 20, 2025
5 checks passed
@TheOutdoorProgrammer TheOutdoorProgrammer deleted the defensive-scaling branch November 20, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants