Add histogram metric for error classification duration#4827
Add histogram metric for error classification duration#4827dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR instruments Confidence Score: 4/5Safe to merge after addressing the unconditional Observe call for pods without a queue label. One P1 finding remains: the histogram emits with queue="" for non-Armada pods, creating a spurious time series that contradicts the stated design intent. The internal/executor/categorizer/classifier.go — specifically the guard around the Observe call on line 150.
|
| Filename | Overview |
|---|---|
| internal/executor/categorizer/classifier.go | Adds error classification histogram metric; start is set unconditionally (already flagged) and the Observe call emits with queue="" for non-Armada pods, creating a spurious time series. |
Sequence Diagram
sequenceDiagram
participant ER as EventReporter / PodIssueHandler
participant CL as Classifier.Classify(pod)
participant PR as Prometheus Histogram
ER->>CL: Classify(pod)
CL->>CL: start = time.Now()
CL->>CL: failedContainers(pod)
CL->>CL: categoryMatches() for each category
CL->>PR: WithLabelValues(pod.Labels["armada_queue_id"]).Observe(elapsed)
Note over CL,PR: queue="" if label absent (spurious series)
CL-->>ER: []string{matched categories}
Reviews (3): Last reviewed commit: "Add histogram metric for error classific..." | Re-trigger Greptile
e960aa2 to
d8e0b15
Compare
Expose `armada_executor_error_classification_duration_seconds` histogram per queue to monitor the overhead of the error categorizer on the executor's failure-reporting path. The metric is observed inside `Classifier.Classify()`, covering both call sites (event reporter and pod issue handler) with a single instrumentation point. Pods without a queue label are skipped to avoid unbounded cardinality. Buckets are tuned for two observed regimes: simple condition/exit-code matching (sub-millisecond) and regex-based termination message matching (low milliseconds), with a tail up to 100 ms for anomaly detection. Also adds example regex-based error categories to the local executor config for development and testing. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
d8e0b15 to
3c8b0ce
Compare
Summary
armada_executor_error_classification_duration_secondsPrometheus histogram to measure how long error categorization takes in the executor, labeled by queueClassifier.Classify()to cover both call sites (event reporter and pod issue handler) with a single observation pointMotivation
The error categorizer runs on every failed pod in the executor. At scale, we need to monitor the overhead this adds to the failure-reporting path, especially as termination message regex rules grow more complex over time.
The histogram enables:
histogram_quantile(0.99, rate(..._bucket[5m]))rate(..._sum[5m])Design decisions