Prometheus for monitoring and alerting K8s cluster

We can use Prometheus as a complete monitoring and alerting system for our K8s cluster. You can see related K8s scripts at scripts/kubernetes/.

Prometheus Image goes here

Above image would help you in understanding different components part of the alerting and monitoring system. Feel free to modify the configurations as you wish.

Alert Rules

Set of rules or conditions when satisfied Prometheus fires alert to Alert Manager. Rules are written in the YAML configuration file which will be used by Prometheus. There are a number of clauses to be defined in the configuration file to describe the alert triggers.

groups holds different groups of alerts
name Name of the group
rules holds all the alerts belonging to the group.
alert for each rule this clause holds the name of the alert.
annotations clause is used to hold the alert description.
expr clause holds a boolean expression when satisfied Prometheus fires the associated alert to Alert Manager.
for clause holds the duration Prometheus to wait for further same alerts after firing the first alert.
labels clause holds various labels attached to every alert which will be used by the Alert Manager for customized notification.

expr holds the expression which will be utilized by Prometheus to fire alerts. We use metric names to get various metric values on which the operations are done yielding a boolean value.

For example to fire an alert for no pods under a deployment we write an expression as below

sum(kube_deployment_status_replicas{pod_template_hash=""}) by (deployment,namespace)  < 1

breaking down that expression, kube_deployment_status_replicas is the metric name provided by kube-state-metrics metric exporter which states the number of pods currently running in a deployment. kube_deployment_status_replicas metric result looks something like below

kube_deployment_spec_replicas{deployment="XXXX",instance="XXXX",job="XXXX",namespace="XXXX",scrape_endpoint="XXXX"} 1

We use the aggregation operator sum to get the data as a single datum and use by to get only required components and then we check if the count is less than 1 if true to fire an alert as there are no running pods under a deployment.

Uh oh!

Prometheus for monitoring and alerting K8s cluster