Add histogram metric for error classification duration by dejanzele · Pull Request #4827 · armadaproject/armada

dejanzele · 2026-04-09T14:28:13Z

Summary

Adds armada_executor_error_classification_duration_seconds Prometheus histogram to measure how long error categorization takes in the executor, labeled by queue
Instrumented inside Classifier.Classify() to cover both call sites (event reporter and pod issue handler) with a single observation point

Motivation

The error categorizer runs on every failed pod in the executor. At scale, we need to monitor the overhead this adds to the failure-reporting path, especially as termination message regex rules grow more complex over time.

The histogram enables:

Tracking p99 classification latency per queue via histogram_quantile(0.99, rate(..._bucket[5m]))
Monitoring aggregate CPU overhead via rate(..._sum[5m])
Alerting when classification exceeds acceptable thresholds (e.g., 50ms+)

Design decisions

Buckets: Hand-tuned for two observed regimes: condition/exit-code matching (sub-ms) and regex matching (low ms), with a 50-100ms tail for anomaly detection. 7 explicit buckets keep cardinality low.

greptile-apps · 2026-04-09T14:30:50Z

Greptile Summary

This PR instruments Classifier.Classify() with a armada_executor_error_classification_duration_seconds Prometheus histogram, labeled by queue, to track per-queue latency of pod error categorization. The implementation is straightforward but has two related placement issues: start := time.Now() and the subsequent Observe call both execute unconditionally, meaning pods that lack the armada_queue_id label produce a queue=\"\" time series that shouldn't exist according to the stated design.

Confidence Score: 4/5

Safe to merge after addressing the unconditional Observe call for pods without a queue label.

One P1 finding remains: the histogram emits with queue="" for non-Armada pods, creating a spurious time series that contradicts the stated design intent. The time.Now() placement issue was already flagged in a prior thread. Both are low-risk in terms of crashes or data loss, but the incorrect metric label value is a real defect worth fixing before merging.

internal/executor/categorizer/classifier.go — specifically the guard around the Observe call on line 150.

Vulnerabilities

No security concerns identified. The change adds a read-only Prometheus metric derived from an existing pod label; no user-controlled input flows into metric names or values beyond the queue label already used throughout the executor.

Important Files Changed

Filename	Overview
internal/executor/categorizer/classifier.go	Adds error classification histogram metric; `start` is set unconditionally (already flagged) and the `Observe` call emits with `queue=""` for non-Armada pods, creating a spurious time series.

Sequence Diagram

sequenceDiagram
    participant ER as EventReporter / PodIssueHandler
    participant CL as Classifier.Classify(pod)
    participant PR as Prometheus Histogram

    ER->>CL: Classify(pod)
    CL->>CL: start = time.Now()
    CL->>CL: failedContainers(pod)
    CL->>CL: categoryMatches() for each category
    CL->>PR: WithLabelValues(pod.Labels["armada_queue_id"]).Observe(elapsed)
    Note over CL,PR: queue="" if label absent (spurious series)
    CL-->>ER: []string{matched categories}

_{Reviews (3): Last reviewed commit: "Add histogram metric for error classific..." | Re-trigger Greptile}

Expose `armada_executor_error_classification_duration_seconds` histogram per queue to monitor the overhead of the error categorizer on the executor's failure-reporting path. The metric is observed inside `Classifier.Classify()`, covering both call sites (event reporter and pod issue handler) with a single instrumentation point. Pods without a queue label are skipped to avoid unbounded cardinality. Buckets are tuned for two observed regimes: simple condition/exit-code matching (sub-millisecond) and regex-based termination message matching (low milliseconds), with a tail up to 100 ms for anomaly detection. Also adds example regex-based error categories to the local executor config for development and testing. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

greptile-apps Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread internal/executor/categorizer/classifier.go Outdated

dejanzele force-pushed the feat/classification-histogram branch from e960aa2 to d8e0b15 Compare April 9, 2026 14:32

dejanzele force-pushed the feat/classification-histogram branch from d8e0b15 to 3c8b0ce Compare April 9, 2026 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add histogram metric for error classification duration#4827

Add histogram metric for error classification duration#4827
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:feat/classification-histogram

dejanzele commented Apr 9, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 9, 2026 •

edited

Loading

Vulnerabilities

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dejanzele commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design decisions

Uh oh!

greptile-apps Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Vulnerabilities

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dejanzele commented Apr 9, 2026 •

edited

Loading

greptile-apps Bot commented Apr 9, 2026 •

edited

Loading