feat: defensive scaling #43

TheOutdoorProgrammer · 2025-11-19T18:04:17Z

Summary

Adds defensive scaling safeguards to detect and recover from unreasonably high ASG desired capacity values that can occur during AWS service disruptions or external modifications.

Motivation

During AWS service disruptions, the Auto Scaling Group desired capacity can be incorrectly set to extremely high values (e.g., 79 instances when only 3 workers and 5 pending runs exist). This can cause massive unnecessary scaling, leading to significant cost overruns and operational issues. We need a mechanism to detect these anomalies and automatically reset to sane capacity values.

Changes

Sanity check logic in autoscaler: Before normal scaling operations, calculate expected maximum capacity based on valid workers, pending runs, and scaling buffer. If ASG desired capacity significantly exceeds this value, log an error and reset to a reasonable value.
Graceful handling of invalid worker metadata: Workers with missing or invalid metadata are now logged and skipped rather than causing fatal errors, preventing invalid workers from blocking autoscaling operations.
New ValidWorkerCount() method: Track only workers with valid metadata separately from total worker count, ensuring scaling decisions are based on actual functional workers.
Updated scaling decision logic: All scaling decisions now use validWorkerCount instead of total worker count to prevent stray/invalid workers from affecting capacity calculations.
Comprehensive test coverage: Added tests for sanity check scenarios including high capacity detection, min/max size boundary handling, and reasonable capacity validation.

Features

Configurable sanity threshold: New AUTOSCALING_CAPACITY_SANITY_CHECK environment variable (defaults to 10) controls the excess capacity threshold that triggers defensive reset.
Automatic capacity recovery: When suspicious capacity is detected, automatically resets to valid_workers + pending_runs (respecting ASG min/max constraints).
Non-fatal error handling: Invalid worker metadata no longer blocks autoscaling; workers are skipped with warning logs and operations continue.
Detailed logging: Error and warning logs include full context (current capacity, valid workers, pending runs, expected capacity, difference) for debugging and alerting.

Usage

The sanity check runs automatically on every scaling cycle. To customize the threshold:

# Set threshold to 20 instances instead of default 10
export AUTOSCALING_CAPACITY_SANITY_CHECK=20

The sanity check triggers when:

excess_capacity = asg_desired_capacity - (valid_workers + pending_runs + max_create)
if excess_capacity >= sanity_threshold:
    # Reset capacity to valid_workers + pending_runs

When triggered, logs will show:

ERROR: "ASG desired capacity is suspiciously high" with full diagnostic context
WARN: "attempting to reset ASG desired capacity to sane value" with old/new values
INFO: "successfully reset ASG desired capacity" on successful recovery

feat: defensive scaling

78f5b6f

TheOutdoorProgrammer requested review from jubranNassar and peterdeme November 19, 2025 18:07

peterdeme approved these changes Nov 19, 2025

View reviewed changes

TheOutdoorProgrammer merged commit de8c964 into main Nov 20, 2025
5 checks passed

TheOutdoorProgrammer deleted the defensive-scaling branch November 20, 2025 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: defensive scaling #43

feat: defensive scaling #43

Uh oh!

TheOutdoorProgrammer commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: defensive scaling #43

feat: defensive scaling #43

Uh oh!

Conversation

TheOutdoorProgrammer commented Nov 19, 2025

Summary

Motivation

Changes

Features

Usage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants