Skip to content

Add histogram metric for error classification duration#4827

Open
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:feat/classification-histogram
Open

Add histogram metric for error classification duration#4827
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:feat/classification-histogram

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 9, 2026

Summary

  • Adds armada_executor_error_classification_duration_seconds Prometheus histogram to measure how long error categorization takes in the executor, labeled by queue
  • Instrumented inside Classifier.Classify() to cover both call sites (event reporter and pod issue handler) with a single observation point

Motivation

The error categorizer runs on every failed pod in the executor. At scale, we need to monitor the overhead this adds to the failure-reporting path, especially as termination message regex rules grow more complex over time.

The histogram enables:

  • Tracking p99 classification latency per queue via histogram_quantile(0.99, rate(..._bucket[5m]))
  • Monitoring aggregate CPU overhead via rate(..._sum[5m])
  • Alerting when classification exceeds acceptable thresholds (e.g., 50ms+)

Design decisions

  • Buckets: Hand-tuned for two observed regimes: condition/exit-code matching (sub-ms) and regex matching (low ms), with a 50-100ms tail for anomaly detection. 7 explicit buckets keep cardinality low.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Greptile Summary

This PR instruments Classifier.Classify() with a armada_executor_error_classification_duration_seconds Prometheus histogram, labeled by queue, to track per-queue latency of pod error categorization. The implementation is straightforward but has two related placement issues: start := time.Now() and the subsequent Observe call both execute unconditionally, meaning pods that lack the armada_queue_id label produce a queue=\"\" time series that shouldn't exist according to the stated design.

Confidence Score: 4/5

Safe to merge after addressing the unconditional Observe call for pods without a queue label.

One P1 finding remains: the histogram emits with queue="" for non-Armada pods, creating a spurious time series that contradicts the stated design intent. The time.Now() placement issue was already flagged in a prior thread. Both are low-risk in terms of crashes or data loss, but the incorrect metric label value is a real defect worth fixing before merging.

internal/executor/categorizer/classifier.go — specifically the guard around the Observe call on line 150.

Vulnerabilities

No security concerns identified. The change adds a read-only Prometheus metric derived from an existing pod label; no user-controlled input flows into metric names or values beyond the queue label already used throughout the executor.

Important Files Changed

Filename Overview
internal/executor/categorizer/classifier.go Adds error classification histogram metric; start is set unconditionally (already flagged) and the Observe call emits with queue="" for non-Armada pods, creating a spurious time series.

Sequence Diagram

sequenceDiagram
    participant ER as EventReporter / PodIssueHandler
    participant CL as Classifier.Classify(pod)
    participant PR as Prometheus Histogram

    ER->>CL: Classify(pod)
    CL->>CL: start = time.Now()
    CL->>CL: failedContainers(pod)
    CL->>CL: categoryMatches() for each category
    CL->>PR: WithLabelValues(pod.Labels["armada_queue_id"]).Observe(elapsed)
    Note over CL,PR: queue="" if label absent (spurious series)
    CL-->>ER: []string{matched categories}
Loading

Reviews (3): Last reviewed commit: "Add histogram metric for error classific..." | Re-trigger Greptile

Comment thread internal/executor/categorizer/classifier.go Outdated
@dejanzele dejanzele force-pushed the feat/classification-histogram branch from e960aa2 to d8e0b15 Compare April 9, 2026 14:32
Expose `armada_executor_error_classification_duration_seconds` histogram
per queue to monitor the overhead of the error categorizer on the
executor's failure-reporting path.

The metric is observed inside `Classifier.Classify()`, covering both
call sites (event reporter and pod issue handler) with a single
instrumentation point. Pods without a queue label are skipped to avoid
unbounded cardinality.

Buckets are tuned for two observed regimes: simple condition/exit-code
matching (sub-millisecond) and regex-based termination message matching
(low milliseconds), with a tail up to 100 ms for anomaly detection.

Also adds example regex-based error categories to the local executor
config for development and testing.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the feat/classification-histogram branch from d8e0b15 to 3c8b0ce Compare April 9, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant