Fix scheduler recording rules: typo, dangling reference, increase windows by dejanzele · Pull Request #4983 · armadaproject/armada

dejanzele · 2026-06-26T13:04:01Z

The scheduler scheduler-prometheusrule.yaml has a few pre-existing bugs that make some recording rules record nothing or compute the wrong window. These are independent of any metric change, so they are split out here. All recorded-series names are left unchanged, so this is backwards compatible for existing dashboards and alerts.

Fixes, per rule:

cluster_category_subCategory:armada_scheduler_failed_jobs read armada_scheduler_error_classification_by_node (missing job_), a metric that does not exist, so it recorded nothing. Now reads armada_scheduler_job_error_classification_by_node.
node:armada_scheduler_failed_jobs:increase{1m,10m,1h} referenced node:armada_scheduler_job_state_counter_by_queue, a series no rule records, so node failed-increase and the node failure-rate built on it recorded nothing. Now reference the base node:armada_scheduler_failed_jobs record.
node:armada_scheduler_failed_jobs:increase1h and queue_category_subCategory:armada_scheduler_succeeded_jobs:increase{10m,1h} used a shorter subquery window than their record name (for example [1m:] under an increase10m record). Windows now match the record name.

The subCategory casing in the recorded-series names is intentionally left as is to keep the series backwards compatible.

…dows Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

greptile-apps · 2026-06-26T13:05:49Z

Greptile Summary

This PR corrects five pre-existing bugs in the scheduler's PrometheusRule recording rules — a metric-name typo, dangling series references, and mismatched subquery windows — without renaming any recorded series, so it is fully backwards-compatible.

Typo fix: armada_scheduler_error_classification_by_node → armada_scheduler_job_error_classification_by_node, restoring data to cluster_category_subCategory:armada_scheduler_failed_jobs.
Dangling-reference fix: the three node:armada_scheduler_failed_jobs:increase* rules previously scraped a non-existent series (node:armada_scheduler_job_state_counter_by_queue); they now correctly reference the node:armada_scheduler_failed_jobs recording rule, which also fixes the failure-rate rules downstream.
Window fixes: node:armada_scheduler_failed_jobs:increase1h now uses [1h:] (was [10m:]), and the two queue_category_subCategory:armada_scheduler_succeeded_jobs:increase{10m,1h} rules now use their correct windows instead of [1m:].

Confidence Score: 5/5

Safe to merge — all changes are targeted corrections to broken PromQL expressions with no side effects on other rules or dashboards.

Every change maps directly to a clearly stated bug: a typo in a metric name, three references to a series that was never recorded, and two subquery windows shorter than their record names imply. The fixes are exact and mechanical — no logic is restructured, no series are renamed, and the downstream rate rules (which depend on the fixed increase rules) will automatically benefit. The rest of the file is unchanged and correct.

No files require special attention.

Important Files Changed

Filename	Overview
deployment/scheduler/templates/scheduler-prometheusrule.yaml	Fixes five bugs: metric-name typo, three dangling series references in node failure-increase rules, and two mismatched subquery windows in queue succeeded-increase and node failure-increase rules. All changes are mechanically correct and no recorded-series names are altered.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[armada_scheduler_job_state_counter_by_node] -->|state=failed| B[node:armada_scheduler_failed_jobs]
    C[armada_scheduler_job_error_classification_by_node] -->|FIXED: was missing job_| D[cluster_category_subCategory:\narmada_scheduler_failed_jobs]
    E[armada_scheduler_job_error_classification_by_queue] --> F[queue_category_subCategory:\narmada_scheduler_failed_jobs]

    B -->|FIXED: was dangling ref\nto non-existent series| G[node:armada_scheduler_failed_jobs:increase1m]
    B -->|FIXED: was dangling ref| H[node:armada_scheduler_failed_jobs:increase10m]
    B -->|FIXED: was dangling ref\n+ wrong window 10m->1h| I[node:armada_scheduler_failed_jobs:increase1h]

    G --> J[node:armada_scheduler_failed_rate_jobs:increase1m]
    H --> K[node:armada_scheduler_failed_rate_jobs:increase10m]
    I --> L[node:armada_scheduler_failed_rate_jobs:increase1h]

    M[armada_scheduler_job_state_counter_by_queue] -->|state=succeeded| N[queue_category_subCategory:\narmada_scheduler_succeeded_jobs]
    N --> O[...:increase1m]
    N -->|FIXED: window 1m->10m| P[...:increase10m]
    N -->|FIXED: window 1m->1h| Q[...:increase1h]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[armada_scheduler_job_state_counter_by_node] -->|state=failed| B[node:armada_scheduler_failed_jobs]
    C[armada_scheduler_job_error_classification_by_node] -->|FIXED: was missing job_| D[cluster_category_subCategory:\narmada_scheduler_failed_jobs]
    E[armada_scheduler_job_error_classification_by_queue] --> F[queue_category_subCategory:\narmada_scheduler_failed_jobs]

    B -->|FIXED: was dangling ref\nto non-existent series| G[node:armada_scheduler_failed_jobs:increase1m]
    B -->|FIXED: was dangling ref| H[node:armada_scheduler_failed_jobs:increase10m]
    B -->|FIXED: was dangling ref\n+ wrong window 10m->1h| I[node:armada_scheduler_failed_jobs:increase1h]

    G --> J[node:armada_scheduler_failed_rate_jobs:increase1m]
    H --> K[node:armada_scheduler_failed_rate_jobs:increase10m]
    I --> L[node:armada_scheduler_failed_rate_jobs:increase1h]

    M[armada_scheduler_job_state_counter_by_queue] -->|state=succeeded| N[queue_category_subCategory:\narmada_scheduler_succeeded_jobs]
    N --> O[...:increase1m]
    N -->|FIXED: window 1m->10m| P[...:increase10m]
    N -->|FIXED: window 1m->1h| Q[...:increase1h]

_{Reviews (1): Last reviewed commit: "Fix scheduler recording rules: typo, dan..." | Re-trigger Greptile}

…ove TrackedErrorRegexes (#4980) ## What Armada exposes now `armada_scheduler_job_error_classification_by_queue` and `_by_node` now label failures with the semantic category from error categorization, read off the `Error` proto (`FailureCategory`/`FailureSubcategory`) instead of a regex match against the message. The metric names and label sets are unchanged. Only the label values change. ``` armada_scheduler_job_error_classification_by_queue{queue="analytics", category="user_error", subcategory=""} 2 armada_scheduler_job_error_classification_by_queue{queue="analytics", category="internal", subcategory="lease-expired"} 1 armada_scheduler_job_error_classification_by_node{node="worker-1", cluster="c1", category="user_error", subcategory=""} 2 ``` `category` was the `Error.Reason` type (`podError`, `leaseExpired`, ...) and is now the semantic category, so dashboards filtering on the old values need updating. `subcategory` was the first matching regex (empty in practice) and is now `FailureSubcategory`. This replaces `trackedErrorRegexes`, which is removed along with it: the `scheduler.metrics.trackedErrorRegexes` config and the `errorTypeAndMessageFromError` / `errorRegexes` plumbing in `metrics.New`. It was never set in-repo, so a deployment still setting it just logs an "unused key" warning. ## Validation End to end on a Helm-deployed stack: failing jobs classified by the executor (`user_error`, `oom`) and a killed-executor run (`internal`/`lease-expired`) all landed on the metric with the expected labels and counts. `max-runs-exceeded` and `job-rejected` are job-level rather than run-level errors, so they do not surface in this metric. The metric names and label sets are unchanged, so the existing PrometheusRule keeps evaluating as before. Pre-existing bugs in that file are fixed separately in #4983. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

Fix scheduler recording rules: typo, dangling reference, increase win…

6f4987f

…dows Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele mentioned this pull request Jun 26, 2026

Source scheduler failure metric labels from error categorization, remove TrackedErrorRegexes #4980

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix scheduler recording rules: typo, dangling reference, increase windows#4983

Fix scheduler recording rules: typo, dangling reference, increase windows#4983
dejanzele wants to merge 1 commit into
armadaproject:masterfrom
dejanzele:fix-scheduler-recording-rules

dejanzele commented Jun 26, 2026

Uh oh!

greptile-apps Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dejanzele commented Jun 26, 2026

Uh oh!

greptile-apps Bot commented Jun 26, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant