Skip to content

Fix scheduler recording rules: typo, dangling reference, increase windows#4983

Open
dejanzele wants to merge 1 commit into
armadaproject:masterfrom
dejanzele:fix-scheduler-recording-rules
Open

Fix scheduler recording rules: typo, dangling reference, increase windows#4983
dejanzele wants to merge 1 commit into
armadaproject:masterfrom
dejanzele:fix-scheduler-recording-rules

Conversation

@dejanzele

Copy link
Copy Markdown
Member

The scheduler scheduler-prometheusrule.yaml has a few pre-existing bugs that make some recording rules record nothing or compute the wrong window. These are independent of any metric change, so they are split out here. All recorded-series names are left unchanged, so this is backwards compatible for existing dashboards and alerts.

Fixes, per rule:

  • cluster_category_subCategory:armada_scheduler_failed_jobs read armada_scheduler_error_classification_by_node (missing job_), a metric that does not exist, so it recorded nothing. Now reads armada_scheduler_job_error_classification_by_node.
  • node:armada_scheduler_failed_jobs:increase{1m,10m,1h} referenced node:armada_scheduler_job_state_counter_by_queue, a series no rule records, so node failed-increase and the node failure-rate built on it recorded nothing. Now reference the base node:armada_scheduler_failed_jobs record.
  • node:armada_scheduler_failed_jobs:increase1h and queue_category_subCategory:armada_scheduler_succeeded_jobs:increase{10m,1h} used a shorter subquery window than their record name (for example [1m:] under an increase10m record). Windows now match the record name.

The subCategory casing in the recorded-series names is intentionally left as is to keep the series backwards compatible.

…dows

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR corrects five pre-existing bugs in the scheduler's PrometheusRule recording rules — a metric-name typo, dangling series references, and mismatched subquery windows — without renaming any recorded series, so it is fully backwards-compatible.

  • Typo fix: armada_scheduler_error_classification_by_nodearmada_scheduler_job_error_classification_by_node, restoring data to cluster_category_subCategory:armada_scheduler_failed_jobs.
  • Dangling-reference fix: the three node:armada_scheduler_failed_jobs:increase* rules previously scraped a non-existent series (node:armada_scheduler_job_state_counter_by_queue); they now correctly reference the node:armada_scheduler_failed_jobs recording rule, which also fixes the failure-rate rules downstream.
  • Window fixes: node:armada_scheduler_failed_jobs:increase1h now uses [1h:] (was [10m:]), and the two queue_category_subCategory:armada_scheduler_succeeded_jobs:increase{10m,1h} rules now use their correct windows instead of [1m:].

Confidence Score: 5/5

Safe to merge — all changes are targeted corrections to broken PromQL expressions with no side effects on other rules or dashboards.

Every change maps directly to a clearly stated bug: a typo in a metric name, three references to a series that was never recorded, and two subquery windows shorter than their record names imply. The fixes are exact and mechanical — no logic is restructured, no series are renamed, and the downstream rate rules (which depend on the fixed increase rules) will automatically benefit. The rest of the file is unchanged and correct.

No files require special attention.

Important Files Changed

Filename Overview
deployment/scheduler/templates/scheduler-prometheusrule.yaml Fixes five bugs: metric-name typo, three dangling series references in node failure-increase rules, and two mismatched subquery windows in queue succeeded-increase and node failure-increase rules. All changes are mechanically correct and no recorded-series names are altered.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[armada_scheduler_job_state_counter_by_node] -->|state=failed| B[node:armada_scheduler_failed_jobs]
    C[armada_scheduler_job_error_classification_by_node] -->|FIXED: was missing job_| D[cluster_category_subCategory:\narmada_scheduler_failed_jobs]
    E[armada_scheduler_job_error_classification_by_queue] --> F[queue_category_subCategory:\narmada_scheduler_failed_jobs]

    B -->|FIXED: was dangling ref\nto non-existent series| G[node:armada_scheduler_failed_jobs:increase1m]
    B -->|FIXED: was dangling ref| H[node:armada_scheduler_failed_jobs:increase10m]
    B -->|FIXED: was dangling ref\n+ wrong window 10m->1h| I[node:armada_scheduler_failed_jobs:increase1h]

    G --> J[node:armada_scheduler_failed_rate_jobs:increase1m]
    H --> K[node:armada_scheduler_failed_rate_jobs:increase10m]
    I --> L[node:armada_scheduler_failed_rate_jobs:increase1h]

    M[armada_scheduler_job_state_counter_by_queue] -->|state=succeeded| N[queue_category_subCategory:\narmada_scheduler_succeeded_jobs]
    N --> O[...:increase1m]
    N -->|FIXED: window 1m->10m| P[...:increase10m]
    N -->|FIXED: window 1m->1h| Q[...:increase1h]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[armada_scheduler_job_state_counter_by_node] -->|state=failed| B[node:armada_scheduler_failed_jobs]
    C[armada_scheduler_job_error_classification_by_node] -->|FIXED: was missing job_| D[cluster_category_subCategory:\narmada_scheduler_failed_jobs]
    E[armada_scheduler_job_error_classification_by_queue] --> F[queue_category_subCategory:\narmada_scheduler_failed_jobs]

    B -->|FIXED: was dangling ref\nto non-existent series| G[node:armada_scheduler_failed_jobs:increase1m]
    B -->|FIXED: was dangling ref| H[node:armada_scheduler_failed_jobs:increase10m]
    B -->|FIXED: was dangling ref\n+ wrong window 10m->1h| I[node:armada_scheduler_failed_jobs:increase1h]

    G --> J[node:armada_scheduler_failed_rate_jobs:increase1m]
    H --> K[node:armada_scheduler_failed_rate_jobs:increase10m]
    I --> L[node:armada_scheduler_failed_rate_jobs:increase1h]

    M[armada_scheduler_job_state_counter_by_queue] -->|state=succeeded| N[queue_category_subCategory:\narmada_scheduler_succeeded_jobs]
    N --> O[...:increase1m]
    N -->|FIXED: window 1m->10m| P[...:increase10m]
    N -->|FIXED: window 1m->1h| Q[...:increase1h]
Loading

Reviews (1): Last reviewed commit: "Fix scheduler recording rules: typo, dan..." | Re-trigger Greptile

dejanzele added a commit that referenced this pull request Jun 26, 2026
…ove TrackedErrorRegexes (#4980)

## What Armada exposes now

`armada_scheduler_job_error_classification_by_queue` and `_by_node` now
label failures with the semantic category from error categorization,
read off the `Error` proto (`FailureCategory`/`FailureSubcategory`)
instead of a regex match against the message. The metric names and label
sets are unchanged. Only the label values change.

```
armada_scheduler_job_error_classification_by_queue{queue="analytics", category="user_error", subcategory=""} 2
armada_scheduler_job_error_classification_by_queue{queue="analytics", category="internal",   subcategory="lease-expired"} 1
armada_scheduler_job_error_classification_by_node{node="worker-1", cluster="c1", category="user_error", subcategory=""} 2
```

`category` was the `Error.Reason` type (`podError`, `leaseExpired`, ...)
and is now the semantic category, so dashboards filtering on the old
values need updating. `subcategory` was the first matching regex (empty
in practice) and is now `FailureSubcategory`.

This replaces `trackedErrorRegexes`, which is removed along with it: the
`scheduler.metrics.trackedErrorRegexes` config and the
`errorTypeAndMessageFromError` / `errorRegexes` plumbing in
`metrics.New`. It was never set in-repo, so a deployment still setting
it just logs an "unused key" warning.

## Validation

End to end on a Helm-deployed stack: failing jobs classified by the
executor (`user_error`, `oom`) and a killed-executor run
(`internal`/`lease-expired`) all landed on the metric with the expected
labels and counts. `max-runs-exceeded` and `job-rejected` are job-level
rather than run-level errors, so they do not surface in this metric.

The metric names and label sets are unchanged, so the existing
PrometheusRule keeps evaluating as before. Pre-existing bugs in that
file are fixed separately in #4983.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant