OpenCost metrics interfere with OpenShift's "degraded control plane" detection?

Dear OpenCost maintainers,

since last week we noticed that our OpenShift cluster show a degradation warning, as only 50% of the apiservers are responding.

Turns out this seems to be related to metrics exposed by OpenCost, scraped by Prometheus and then returned by the query used for this degradation detection.

We have explictly disabled the emission of pod annotations, namespace annotations and ksm V1 metrics and the error vanished.

```
  opencost:
    metrics:
      serviceMonitor:
        enabled: true
      kubeStateMetrics:
        emitPodAnnotations: false
        emitNamespaceAnnotations: false
        emitKsmV1Metrics: false
```
The following lines appeared in the deployment:

```
        - name: EMIT_POD_ANNOTATIONS_METRIC
          value: 'false'
        - name: EMIT_NAMESPACE_ANNOTATIONS_METRIC
          value: 'false'
        - name: EMIT_KSM_V1_METRICS
          value: 'false'
```

I would like to see this added to the documentation that @mittal-ishaan was working on IIRC.

The query that went wrong was this:

```
count(kube_pod_labels{label_app="openshift-kube-apiserver", label_apiserver="true", namespace="openshift-kube-apiserver" })
```

Before we introduced the workaround described above, this returned 6 pods, while only three were really running. Hence the degradation warning as only 50% were working...

Kind Regards,
Johannes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCost metrics interfere with OpenShift's "degraded control plane" detection? #249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenCost metrics interfere with OpenShift's "degraded control plane" detection? #249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions