Skip to content

[Subtask]: AIP-5 Phase 1 — Dynamic allocation configuration, validation, and min-parallelism deprecation #4253

@j1wonpark

Description

@j1wonpark

Search before asking

  • I have searched in the issues and found no similar issues.

Description

This is the first implementation phase of AIP-5: Dynamic Resource Allocation for Optimizer. It lays the configuration foundation that the later scaling phases build on, with no runtime scaling behavior yet. Dynamic allocation is opt-in and disabled by default (dynamic-allocation.enabled = false), so existing groups are unaffected and this phase is safe to merge independently. The full dynamic-allocation.* property set is declared here so the configuration surface and its validation are settled in one place, rather than drip-fed per phase.

Scope:

  1. Configuration properties. Declare the dynamic-allocation.* properties as constants in OptimizerProperties, each annotated with @since:

    • dynamic-allocation.enabled (default false)
    • dynamic-allocation.min-parallelism (default 0)
    • dynamic-allocation.max-parallelism (required when enabled; hard-capped at 1024)
    • dynamic-allocation.scheduler-backlog-timeout (default 1min)
    • dynamic-allocation.sustained-backlog-timeout (default 30s)
    • dynamic-allocation.executor-idle-timeout (default 5min, minimum 30s)
    • dynamic-allocation.scale-down-cooldown (default 1min)
    • dynamic-allocation.drain-timeout (default 15min)
  2. DynamicAllocationConfig parsing + validation. A new DynamicAllocationConfig.parse(ResourceGroup) / validate() that enforces:

    • max-parallelism is mandatory when enabled;
    • min-parallelism ≤ max-parallelism ≤ 1024;
    • executor-idle-timeout ≥ 30s;
    • all durations parse to positive values;
    • DRA is not enabled on an externally-registered optimizer (conservative check in this phase: reject the external container).
  3. Failure handling.

    • REST create/update — fail fast. OptimizerGroupController.createResourceGroup / updateResourceGroup reject an invalid DRA config with HTTP 400. (The update path previously had no property-value validation at all.)
    • Startup load — fail safe, never silent. A persisted group with an invalid DRA config no longer crashes AMS; it logs a warning and falls back to DRA-disabled behavior. (The optimizer_group_config_invalid gauge is wired in the observability phase.)
  4. min-parallelism deprecation. The flat min-parallelism property is deprecated in favor of dynamic-allocation.min-parallelism but still honored. Resolution order: namespaced → legacy → 0. A one-off deprecation warning is logged at config-entry points (startup load, REST), not on the keeper hot path.

  5. Auto-reset compatibility. The keeper already auto-resets a group's min-parallelism after optimizer-group.max-keeping-attempts consecutive failed optimizer creations. Two adjustments keep this consistent with the new property model:

    • Disabled when DRA is effectively enabled. Once a group opts into dynamic allocation (and its config is valid), DRA owns the group's scale decisions, so the keeper no longer erodes its min-parallelism floor — it keeps retrying instead. "Effectively enabled" means opted-in and valid; an invalid config is treated as disabled, matching the startup fail-safe, so such groups still auto-reset as before.
    • Writes the key the group actually uses. When auto-reset does apply, it now writes whichever min-parallelism key the group reads from (namespaced if present, else the legacy flat key). Previously it always wrote the flat key; for a group using the namespaced property that write was shadowed by resolution order, turning auto-reset into an endless no-op loop (repeated warning + DB update with no effect).

Out of scope for this phase (later phases): demand-driven scale-up, idle tracking, scale-down + graceful drain, and the new metrics. The config_invalid gauge and scale_*_total counters land in the observability phase.

Parent issue

AIP-5: #4191

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions