Skip to content

Add retry policy engine and matching primitives#4804

Open
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:retry-engine-config
Open

Add retry policy engine and matching primitives#4804
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:retry-engine-config

Conversation

@dejanzele
Copy link
Copy Markdown
Member

What type of PR is this?

Feature (retry policy PR 1 of 4)

What this PR does / why we need it

Adds the retry policy evaluation engine as self-contained, dead code. Nothing calls it yet - this PR provides the building blocks that the scheduler wiring PR connects.

  • Adds internal/scheduler/retry/ package with the core engine:
    • types.go - Policy, Rule, Result, Action types with CompileRules() for pre-compiling regex patterns
    • extract.go - Four extraction functions that pull condition, exit code, termination message, and categories from *armadaevents.Error, preferring FailureInfo with fallback to ContainerError
    • matcher.go - matchRule (AND logic across all non-empty fields) and matchRules (first-match-wins iteration)
    • engine.go - Engine.Evaluate() with global cap check, rule matching, default action fallback, and per-policy retry limit
  • Adds RetryPolicyConfig to scheduler configuration (feature flag + global max retries, disabled by default)
  • Reuses errormatch.ExitCodeMatcher, errormatch.RegexMatcher, and errormatch.MatchExitCode/MatchPattern from internal/common/errormatch/
  • Condition extraction derives failure conditions directly from the Error proto oneof + KubernetesReason, no proto changes needed

Which issue(s) this PR fixes

Part of #4683 (Retry Policy)

Special notes for your reviewer

  • This is retry policy PR 1 of 4: Engine + config (this) and CRUD + armadactl (independent) -> Scheduler wiring -> Backoff + pod naming
  • Depends on Add error categorization proto schema and executor classifier #4741 (error categorization proto + errormatch package)
  • This is entirely dead code with no behavior change. Focus review on matching correctness and edge case handling.
  • Exit code 0 never matches (proto3 default = "not set")
  • RetryLimit: 0 means unlimited retries (subject to global cap)
  • Rule matching uses AND logic: all non-empty fields must match. Rules are evaluated in order, first match wins.
  • 41 test cases covering all match types, edge cases, nil handling, and regex compilation

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR adds the retry policy evaluation engine as dead code — internal/scheduler/retry/ package with engine, matcher, extractor, and type definitions, plus RetryPolicyConfig wired into the scheduler configuration. The logic is sound and test coverage is thorough (41 cases), but there is one build-blocking issue: errormatch.ConditionAppError is referenced in extract.go and the tests but is not defined in internal/common/errormatch/types.go, which will prevent the package from compiling.

Confidence Score: 4/5

Do not merge until the missing errormatch.ConditionAppError constant is added — the package will not compile.

A single P1 compile error prevents the package from building. All other logic (matching, extraction, limit semantics) is correct and well-tested. Once the missing constant is added to the errormatch package, the remaining feedback is P2 hardening (Rule.Action validation in validateRule).

internal/scheduler/retry/extract.go and internal/common/errormatch/types.go — the constant ConditionAppError must be added to the errormatch package.

Important Files Changed

Filename Overview
internal/scheduler/retry/extract.go References undefined errormatch.ConditionAppError — compile error; all other extraction logic is correct with appropriate nil and zero-value guards
internal/scheduler/retry/engine.go Evaluation engine with correct globalMaxRetries=0 (unlimited) and RetryLimit=0 (unlimited) guards; action dispatch is safe given ValidatePolicy enforces DefaultAction
internal/scheduler/retry/types.go Policy/Rule types with CompileRules() and validateRule(); validates match fields and operators but does not validate Rule.Action
internal/scheduler/retry/matcher.go AND-logic matchRule guarding on compiledPattern != nil rather than OnTerminationMessage != nil; correct after CompileRules is called
internal/scheduler/retry/engine_test.go 41 test cases covering all main evaluation paths, limit semantics, AND logic, first-match-wins, nil error, and globalMaxRetries=0; also exercises undefined ConditionAppError
internal/scheduler/retry/extract_test.go Thorough extraction tests for all five KubernetesReasons and fallback paths; references undefined errormatch.ConditionAppError
internal/scheduler/configuration/retry.go New RetryPolicyConfig struct with Enabled flag and GlobalMaxRetries; clean and minimal
internal/scheduler/configuration/configuration.go Adds RetryPolicyConfig field to SchedulingConfig; straightforward additive change
config/scheduler/config.yaml Adds retryPolicy block with enabled=false and globalMaxRetries=5 as safe opt-in defaults

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[engine.Evaluate called] --> B{runError == nil?}
    B -->|yes| C[Return: no error information]
    B -->|no| D{globalMaxRetries > 0 AND totalRuns >= globalMaxRetries?}
    D -->|yes| E[Return: global max retries exceeded]
    D -->|no| F[Extract: condition, exitCode, terminationMessage, categories]
    F --> G[matchRules: iterate rules in order]
    G --> H{Rule matched?}
    H -->|yes| I[action = rule.Action]
    H -->|no| J[action = policy.DefaultAction]
    I --> K{action == ActionFail?}
    J --> K
    K -->|yes| L[Return: ShouldRetry=false]
    K -->|no| M{RetryLimit > 0 AND failureCount >= RetryLimit?}
    M -->|yes| N[Return: policy retry limit exceeded]
    M -->|no| O[Return: ShouldRetry=true]

    subgraph matchRule [matchRule — AND logic]
        P[OnConditions set?] -->|yes| Q{condition in list?}
        Q -->|no| R[no match]
        P -->|no| S[OnExitCodes set?]
        Q -->|yes| S
        S -->|yes| T{MatchExitCode?}
        T -->|no| R
        S -->|no| U[compiledPattern set?]
        T -->|yes| U
        U -->|yes| V{MatchPattern?}
        V -->|no| R
        U -->|no| W[OnCategories set?]
        V -->|yes| W
        W -->|yes| X{any category overlap?}
        X -->|no| R
        X -->|yes| Y[match]
        W -->|no| Y
    end
Loading

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Reviews (10): Last reviewed commit: "Add retry policy engine and matching pri..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the retry-engine-config branch from bca4896 to 1ed96c9 Compare March 30, 2026 13:27
@dejanzele dejanzele force-pushed the retry-engine-config branch from 1ed96c9 to ec73ad8 Compare March 30, 2026 14:21
@dejanzele dejanzele force-pushed the retry-engine-config branch from ec73ad8 to a3d9e04 Compare March 30, 2026 14:48
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the retry-engine-config branch 2 times, most recently from e39ec2f to 33b0802 Compare March 30, 2026 16:14
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the retry-engine-config branch 2 times, most recently from 995056d to cd5ea06 Compare March 31, 2026 10:41
Introduces a policy-based retry engine that evaluates Error protos against
configurable rules to decide whether a failed job run should be retried.
This is dead code - nothing calls the engine yet.

- types.go: Policy, Rule, Result, Action types with regex pre-compilation
- extract.go: extract condition, exit code, termination message, categories
  from armadaevents.Error (FailureInfo preferred, ContainerError fallback)
- matcher.go: AND-logic rule matching, first-match-wins rule list evaluation
- engine.go: Evaluate() with global cap, policy retry limit, and rule matching
- configuration: RetryPolicyConfig type, wired into SchedulingConfig
- config.yaml: default retryPolicy (disabled, globalMaxRetries=20)

Signed-off-by: Dejan Zele Pejchev <dejan.pejchev@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the retry-engine-config branch from cd5ea06 to b8aecc4 Compare April 7, 2026 13:11
case armadaevents.KubernetesReason_DeadlineExceeded:
return errormatch.ConditionDeadlineExceeded
case armadaevents.KubernetesReason_AppError:
return errormatch.ConditionAppError
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing errormatch.ConditionAppError constant — compile error

errormatch.ConditionAppError is referenced here but is not defined anywhere in internal/common/errormatch/types.go, which defines the other condition constants (ConditionOOMKilled, ConditionEvicted, ConditionDeadlineExceeded, ConditionPreempted, ConditionLeaseReturned). This is a build failure. The same undefined symbol appears in extract_test.go:51 and engine_test.go:151, 271, 272, 386, 405. Add the missing constant to internal/common/errormatch/types.go:

const (
    ConditionAppError = "AppError"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant