Search before asking
Description
This is the first implementation phase of AIP-5: Dynamic Resource Allocation for Optimizer. It lays the configuration foundation that the later scaling phases build on, with no runtime scaling behavior yet. Dynamic allocation is opt-in and disabled by default (dynamic-allocation.enabled = false), so existing groups are unaffected and this phase is safe to merge independently. The full dynamic-allocation.* property set is declared here so the configuration surface and its validation are settled in one place, rather than drip-fed per phase.
Scope:
-
Configuration properties. Declare the dynamic-allocation.* properties as constants in OptimizerProperties, each annotated with @since:
dynamic-allocation.enabled (default false)
dynamic-allocation.min-parallelism (default 0)
dynamic-allocation.max-parallelism (required when enabled; hard-capped at 1024)
dynamic-allocation.scheduler-backlog-timeout (default 1min)
dynamic-allocation.sustained-backlog-timeout (default 30s)
dynamic-allocation.executor-idle-timeout (default 5min, minimum 30s)
dynamic-allocation.scale-down-cooldown (default 1min)
dynamic-allocation.drain-timeout (default 15min)
-
DynamicAllocationConfig parsing + validation. A new DynamicAllocationConfig.parse(ResourceGroup) / validate() that enforces:
max-parallelism is mandatory when enabled;
min-parallelism ≤ max-parallelism ≤ 1024;
executor-idle-timeout ≥ 30s;
- all durations parse to positive values;
- DRA is not enabled on an externally-registered optimizer (conservative check in this phase: reject the
external container).
-
Failure handling.
- REST create/update — fail fast.
OptimizerGroupController.createResourceGroup / updateResourceGroup reject an invalid DRA config with HTTP 400. (The update path previously had no property-value validation at all.)
- Startup load — fail safe, never silent. A persisted group with an invalid DRA config no longer crashes AMS; it logs a warning and falls back to DRA-disabled behavior. (The
optimizer_group_config_invalid gauge is wired in the observability phase.)
-
min-parallelism deprecation. The flat min-parallelism property is deprecated in favor of dynamic-allocation.min-parallelism but still honored. Resolution order: namespaced → legacy → 0. A one-off deprecation warning is logged at config-entry points (startup load, REST), not on the keeper hot path.
-
Auto-reset compatibility. The keeper already auto-resets a group's min-parallelism after optimizer-group.max-keeping-attempts consecutive failed optimizer creations. Two adjustments keep this consistent with the new property model:
- Disabled when DRA is effectively enabled. Once a group opts into dynamic allocation (and its config is valid), DRA owns the group's scale decisions, so the keeper no longer erodes its
min-parallelism floor — it keeps retrying instead. "Effectively enabled" means opted-in and valid; an invalid config is treated as disabled, matching the startup fail-safe, so such groups still auto-reset as before.
- Writes the key the group actually uses. When auto-reset does apply, it now writes whichever
min-parallelism key the group reads from (namespaced if present, else the legacy flat key). Previously it always wrote the flat key; for a group using the namespaced property that write was shadowed by resolution order, turning auto-reset into an endless no-op loop (repeated warning + DB update with no effect).
Out of scope for this phase (later phases): demand-driven scale-up, idle tracking, scale-down + graceful drain, and the new metrics. The config_invalid gauge and scale_*_total counters land in the observability phase.
Parent issue
AIP-5: #4191
Are you willing to submit PR?
Code of Conduct
Search before asking
Description
This is the first implementation phase of AIP-5: Dynamic Resource Allocation for Optimizer. It lays the configuration foundation that the later scaling phases build on, with no runtime scaling behavior yet. Dynamic allocation is opt-in and disabled by default (
dynamic-allocation.enabled = false), so existing groups are unaffected and this phase is safe to merge independently. The fulldynamic-allocation.*property set is declared here so the configuration surface and its validation are settled in one place, rather than drip-fed per phase.Scope:
Configuration properties. Declare the
dynamic-allocation.*properties as constants inOptimizerProperties, each annotated with@since:dynamic-allocation.enabled(defaultfalse)dynamic-allocation.min-parallelism(default0)dynamic-allocation.max-parallelism(required when enabled; hard-capped at1024)dynamic-allocation.scheduler-backlog-timeout(default1min)dynamic-allocation.sustained-backlog-timeout(default30s)dynamic-allocation.executor-idle-timeout(default5min, minimum30s)dynamic-allocation.scale-down-cooldown(default1min)dynamic-allocation.drain-timeout(default15min)DynamicAllocationConfigparsing + validation. A newDynamicAllocationConfig.parse(ResourceGroup)/validate()that enforces:max-parallelismis mandatory when enabled;min-parallelism ≤ max-parallelism ≤ 1024;executor-idle-timeout ≥ 30s;externalcontainer).Failure handling.
OptimizerGroupController.createResourceGroup/updateResourceGroupreject an invalid DRA config with HTTP 400. (The update path previously had no property-value validation at all.)optimizer_group_config_invalidgauge is wired in the observability phase.)min-parallelismdeprecation. The flatmin-parallelismproperty is deprecated in favor ofdynamic-allocation.min-parallelismbut still honored. Resolution order: namespaced → legacy →0. A one-off deprecation warning is logged at config-entry points (startup load, REST), not on the keeper hot path.Auto-reset compatibility. The keeper already auto-resets a group's
min-parallelismafteroptimizer-group.max-keeping-attemptsconsecutive failed optimizer creations. Two adjustments keep this consistent with the new property model:min-parallelismfloor — it keeps retrying instead. "Effectively enabled" means opted-in and valid; an invalid config is treated as disabled, matching the startup fail-safe, so such groups still auto-reset as before.min-parallelismkey the group reads from (namespaced if present, else the legacy flat key). Previously it always wrote the flat key; for a group using the namespaced property that write was shadowed by resolution order, turning auto-reset into an endless no-op loop (repeated warning + DB update with no effect).Out of scope for this phase (later phases): demand-driven scale-up, idle tracking, scale-down + graceful drain, and the new metrics. The
config_invalidgauge andscale_*_totalcounters land in the observability phase.Parent issue
AIP-5: #4191
Are you willing to submit PR?
Code of Conduct