Fix scheduler recording rules: typo, dangling reference, increase windows#4983
Open
dejanzele wants to merge 1 commit into
Open
Fix scheduler recording rules: typo, dangling reference, increase windows#4983dejanzele wants to merge 1 commit into
dejanzele wants to merge 1 commit into
Conversation
…dows Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Contributor
dejanzele
added a commit
that referenced
this pull request
Jun 26, 2026
…ove TrackedErrorRegexes (#4980) ## What Armada exposes now `armada_scheduler_job_error_classification_by_queue` and `_by_node` now label failures with the semantic category from error categorization, read off the `Error` proto (`FailureCategory`/`FailureSubcategory`) instead of a regex match against the message. The metric names and label sets are unchanged. Only the label values change. ``` armada_scheduler_job_error_classification_by_queue{queue="analytics", category="user_error", subcategory=""} 2 armada_scheduler_job_error_classification_by_queue{queue="analytics", category="internal", subcategory="lease-expired"} 1 armada_scheduler_job_error_classification_by_node{node="worker-1", cluster="c1", category="user_error", subcategory=""} 2 ``` `category` was the `Error.Reason` type (`podError`, `leaseExpired`, ...) and is now the semantic category, so dashboards filtering on the old values need updating. `subcategory` was the first matching regex (empty in practice) and is now `FailureSubcategory`. This replaces `trackedErrorRegexes`, which is removed along with it: the `scheduler.metrics.trackedErrorRegexes` config and the `errorTypeAndMessageFromError` / `errorRegexes` plumbing in `metrics.New`. It was never set in-repo, so a deployment still setting it just logs an "unused key" warning. ## Validation End to end on a Helm-deployed stack: failing jobs classified by the executor (`user_error`, `oom`) and a killed-executor run (`internal`/`lease-expired`) all landed on the metric with the expected labels and counts. `max-runs-exceeded` and `job-rejected` are job-level rather than run-level errors, so they do not surface in this metric. The metric names and label sets are unchanged, so the existing PrometheusRule keeps evaluating as before. Pre-existing bugs in that file are fixed separately in #4983. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The scheduler
scheduler-prometheusrule.yamlhas a few pre-existing bugs that make some recording rules record nothing or compute the wrong window. These are independent of any metric change, so they are split out here. All recorded-series names are left unchanged, so this is backwards compatible for existing dashboards and alerts.Fixes, per rule:
cluster_category_subCategory:armada_scheduler_failed_jobsreadarmada_scheduler_error_classification_by_node(missingjob_), a metric that does not exist, so it recorded nothing. Now readsarmada_scheduler_job_error_classification_by_node.node:armada_scheduler_failed_jobs:increase{1m,10m,1h}referencednode:armada_scheduler_job_state_counter_by_queue, a series no rule records, so node failed-increase and the node failure-rate built on it recorded nothing. Now reference the basenode:armada_scheduler_failed_jobsrecord.node:armada_scheduler_failed_jobs:increase1handqueue_category_subCategory:armada_scheduler_succeeded_jobs:increase{10m,1h}used a shorter subquery window than their record name (for example[1m:]under anincrease10mrecord). Windows now match the record name.The
subCategorycasing in the recorded-series names is intentionally left as is to keep the series backwards compatible.