-
-
Notifications
You must be signed in to change notification settings - Fork 493
Prometheus for monitoring and alerting K8s cluster
We can use Prometheus as a complete monitoring and alerting system for our K8s cluster. You can see related K8s scripts at scripts/kubernetes/.
Above image would help you in understanding different components part of the alerting and monitoring system. Feel free to modify the configurations as you wish.
Set of rules or conditions when satisfied Prometheus fires alert to Alert Manager. Rules are written in the YAML configuration file which will be used by Prometheus. There are a number of clauses to be defined in the configuration file to describe the alert triggers.
-
groupsholds different groups of alerts -
nameName of the group -
rulesholds all the alerts belonging to the group. -
alertfor each rule this clause holds the name of the alert. -
annotationsclause is used to hold the alert description. -
exprclause holds a boolean expression when satisfied Prometheus fires the associated alert to Alert Manager. -
forclause holds the duration Prometheus to wait for further same alerts after firing the first alert. -
labelsclause holds various labels attached to every alert which will be used by the Alert Manager for customized notification.
expr holds the expression which will be utilized by Prometheus to fire alerts. We use metric names to get various metric values on which the operations are done yielding a boolean value.
For example to fire an alert for no pods under a deployment we write an expression as below
sum(kube_deployment_status_replicas{pod_template_hash=""}) by (deployment,namespace) < 1
breaking down that expression, kube_deployment_status_replicas is the metric name provided by kube-state-metrics metric exporter which states the number of pods currently running in a deployment. kube_deployment_status_replicas metric result looks something like below
kube_deployment_spec_replicas{deployment="XXXX",instance="XXXX",job="XXXX",namespace="XXXX",scrape_endpoint="XXXX"} 1
We use the aggregation operator sum to get the data as a single datum and use by to get only required components and then we check if the count is less than 1 if true to fire an alert as there are no running pods under a deployment.