Add retry policy engine and matching primitives#4804
Add retry policy engine and matching primitives#4804dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR adds the retry policy evaluation engine as dead code — Confidence Score: 4/5Do not merge until the missing errormatch.ConditionAppError constant is added — the package will not compile. A single P1 compile error prevents the package from building. All other logic (matching, extraction, limit semantics) is correct and well-tested. Once the missing constant is added to the errormatch package, the remaining feedback is P2 hardening (Rule.Action validation in validateRule). internal/scheduler/retry/extract.go and internal/common/errormatch/types.go — the constant ConditionAppError must be added to the errormatch package. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[engine.Evaluate called] --> B{runError == nil?}
B -->|yes| C[Return: no error information]
B -->|no| D{globalMaxRetries > 0 AND totalRuns >= globalMaxRetries?}
D -->|yes| E[Return: global max retries exceeded]
D -->|no| F[Extract: condition, exitCode, terminationMessage, categories]
F --> G[matchRules: iterate rules in order]
G --> H{Rule matched?}
H -->|yes| I[action = rule.Action]
H -->|no| J[action = policy.DefaultAction]
I --> K{action == ActionFail?}
J --> K
K -->|yes| L[Return: ShouldRetry=false]
K -->|no| M{RetryLimit > 0 AND failureCount >= RetryLimit?}
M -->|yes| N[Return: policy retry limit exceeded]
M -->|no| O[Return: ShouldRetry=true]
subgraph matchRule [matchRule — AND logic]
P[OnConditions set?] -->|yes| Q{condition in list?}
Q -->|no| R[no match]
P -->|no| S[OnExitCodes set?]
Q -->|yes| S
S -->|yes| T{MatchExitCode?}
T -->|no| R
S -->|no| U[compiledPattern set?]
T -->|yes| U
U -->|yes| V{MatchPattern?}
V -->|no| R
U -->|no| W[OnCategories set?]
V -->|yes| W
W -->|yes| X{any category overlap?}
X -->|no| R
X -->|yes| Y[match]
W -->|no| Y
end
Greploops — Automatically fix all review issues by running Reviews (10): Last reviewed commit: "Add retry policy engine and matching pri..." | Re-trigger Greptile |
bca4896 to
1ed96c9
Compare
1ed96c9 to
ec73ad8
Compare
ec73ad8 to
a3d9e04
Compare
e39ec2f to
33b0802
Compare
995056d to
cd5ea06
Compare
Introduces a policy-based retry engine that evaluates Error protos against configurable rules to decide whether a failed job run should be retried. This is dead code - nothing calls the engine yet. - types.go: Policy, Rule, Result, Action types with regex pre-compilation - extract.go: extract condition, exit code, termination message, categories from armadaevents.Error (FailureInfo preferred, ContainerError fallback) - matcher.go: AND-logic rule matching, first-match-wins rule list evaluation - engine.go: Evaluate() with global cap, policy retry limit, and rule matching - configuration: RetryPolicyConfig type, wired into SchedulingConfig - config.yaml: default retryPolicy (disabled, globalMaxRetries=20) Signed-off-by: Dejan Zele Pejchev <dejan.pejchev@gmail.com> Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
cd5ea06 to
b8aecc4
Compare
| case armadaevents.KubernetesReason_DeadlineExceeded: | ||
| return errormatch.ConditionDeadlineExceeded | ||
| case armadaevents.KubernetesReason_AppError: | ||
| return errormatch.ConditionAppError |
There was a problem hiding this comment.
Missing
errormatch.ConditionAppError constant — compile error
errormatch.ConditionAppError is referenced here but is not defined anywhere in internal/common/errormatch/types.go, which defines the other condition constants (ConditionOOMKilled, ConditionEvicted, ConditionDeadlineExceeded, ConditionPreempted, ConditionLeaseReturned). This is a build failure. The same undefined symbol appears in extract_test.go:51 and engine_test.go:151, 271, 272, 386, 405. Add the missing constant to internal/common/errormatch/types.go:
const (
ConditionAppError = "AppError"
)
What type of PR is this?
Feature (retry policy PR 1 of 4)
What this PR does / why we need it
Adds the retry policy evaluation engine as self-contained, dead code. Nothing calls it yet - this PR provides the building blocks that the scheduler wiring PR connects.
internal/scheduler/retry/package with the core engine:types.go-Policy,Rule,Result,Actiontypes withCompileRules()for pre-compiling regex patternsextract.go- Four extraction functions that pull condition, exit code, termination message, and categories from*armadaevents.Error, preferringFailureInfowith fallback toContainerErrormatcher.go-matchRule(AND logic across all non-empty fields) andmatchRules(first-match-wins iteration)engine.go-Engine.Evaluate()with global cap check, rule matching, default action fallback, and per-policy retry limitRetryPolicyConfigto scheduler configuration (feature flag + global max retries, disabled by default)errormatch.ExitCodeMatcher,errormatch.RegexMatcher, anderrormatch.MatchExitCode/MatchPatternfrominternal/common/errormatch/Which issue(s) this PR fixes
Part of #4683 (Retry Policy)
Special notes for your reviewer
RetryLimit: 0means unlimited retries (subject to global cap)