feat: defensive scaling #43
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds defensive scaling safeguards to detect and recover from unreasonably high ASG desired capacity values that can occur during AWS service disruptions or external modifications.
Motivation
During AWS service disruptions, the Auto Scaling Group desired capacity can be incorrectly set to extremely high values (e.g., 79 instances when only 3 workers and 5 pending runs exist). This can cause massive unnecessary scaling, leading to significant cost overruns and operational issues. We need a mechanism to detect these anomalies and automatically reset to sane capacity values.
Changes
ValidWorkerCount()method: Track only workers with valid metadata separately from total worker count, ensuring scaling decisions are based on actual functional workers.validWorkerCountinstead of total worker count to prevent stray/invalid workers from affecting capacity calculations.Features
AUTOSCALING_CAPACITY_SANITY_CHECKenvironment variable (defaults to 10) controls the excess capacity threshold that triggers defensive reset.valid_workers + pending_runs(respecting ASG min/max constraints).Usage
The sanity check runs automatically on every scaling cycle. To customize the threshold:
The sanity check triggers when:
When triggered, logs will show:
ERROR: "ASG desired capacity is suspiciously high" with full diagnostic contextWARN: "attempting to reset ASG desired capacity to sane value" with old/new valuesINFO: "successfully reset ASG desired capacity" on successful recovery