Skip to content

feat: add K8s SRE Incident Response dashboard#376

Open
infrawithshobhit wants to merge 1 commit into
SigNoz:mainfrom
infrawithshobhit:feat/k8s-sre-incident-response
Open

feat: add K8s SRE Incident Response dashboard#376
infrawithshobhit wants to merge 1 commit into
SigNoz:mainfrom
infrawithshobhit:feat/k8s-sre-incident-response

Conversation

@infrawithshobhit

Copy link
Copy Markdown

Dashboard: K8s SRE Incident Response

A Kubernetes dashboard built for on-call triage, Based on real life example

Q- Why this dashboard?
Most K8s dashboards show everything, during an active incident. That's called noise.
This one surfaces 8 signals that answer, what is broken right now, and why?

Panels included

  • Pod Restart Rate — early crash loop detection
  • Node CPU Pressure — scheduling risk threshold (warn: 80%, crit: 90%)
  • Node Memory Pressure — OOM risk (warn: 85%, crit: 95%)
  • Pending Pods — scheduling failure indicator
  • Container Restarts / OOMKills — memory leak signal
  • p99 Latency — primary SLI for SLO tracking
  • 5XX Error Rate — error budget burn rate
  • PVC Storage Usage — proactive storage incident prevention

Data source

OpenTelemetry Collector with Kubernetes receiver + kubelet metrics.
APM panels require OTel SDK instrumentation.

Variables

  • k8s_cluster_name — cluster selector
  • k8s_namespace_name — multi-select namespace filter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant