Lightweight, event-driven Kubernetes incident response with AI-powered root cause analysis.
Event-driven by design Watches Kubernetes events in real-time. No manual scans, no external alerting required.
Minimal footprint Tiny controller (~10-20MB) runs 24/7. Heavy analysis runs as on-demand Kubernetes Jobs.
Predictable costs
Controller: ~$0.50/month (always-on, minimal resources)
Analysis Jobs: $0.001 per incident (spawn → analyze → terminate)
Config-driven
events:
crashLoopBackOff: true
imagePullBackOff: true
healthCheckFailure: false
llm:
provider: gemini # or claude, openaiEvent occurs → Controller detects → Spawns Job → Analyzes with LLM → Slack notification → Job terminates
- Controller: Watches K8s events, spawns Jobs (10m CPU, 20MB RAM)
- Analysis Job: Fetches logs, calls LLM, sends alerts (500m CPU, 512MB RAM, 30-60s)
- Kubernetes 1.24+
- Helm 3.8+
- LLM API key (Gemini, Claude, or OpenAI)
- Slack webhook URL (optional)
# Install with Gemini
helm install kube-ai-sre-agent oci://ghcr.io/adiii717/kube-ai-sre-agent \
--version 0.1.0 \
--set llm.provider=gemini \
--set llm.apiKey=YOUR_GEMINI_API_KEY \
--set slack.webhook=YOUR_SLACK_WEBHOOK
# Install with Claude
helm install kube-ai-sre-agent oci://ghcr.io/adiii717/kube-ai-sre-agent \
--version 0.1.0 \
--set llm.provider=claude \
--set llm.apiKey=YOUR_CLAUDE_API_KEY \
--set slack.webhook=YOUR_SLACK_WEBHOOKNote for Apple Silicon (M1/M2) users: Published images are amd64 only. For local testing on arm64, build from source (see below).
git clone https://github.com/adiii717/kube-ai-sre-agent.git
cd kube-ai-sre-agent
helm install kube-ai-sre-agent ./helm/kube-ai-sre-agent \
--set llm.provider=gemini \
--set llm.apiKey=YOUR_API_KEY \
--set slack.webhook=YOUR_WEBHOOKCreate a values.yaml file:
# Enable/disable specific event types
events:
crashLoopBackOff: true
imagePullBackOff: true
healthCheckFailure: true
oomKilled: true
# LLM provider configuration
llm:
provider: gemini # gemini, claude, or openai
apiKey: "your-api-key"
# Slack notifications
slack:
enabled: true
webhook: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
# Resource limits
controller:
resources:
requests:
cpu: 10m
memory: 20Mi
limits:
cpu: 100m
memory: 64Mi
analyzer:
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
# Alert deduplication and noise reduction
deduplication:
# Don't re-analyze same incident within this window
cooldownMinutes: 5
# Smart escalation - silence noisy incidents
escalation:
enabled: true
threshold: 10 # Silence after this many incidents
silenceDurationMinutes: 60 # Silence for this durationInstall with custom values:
helm install kube-ai-sre-agent oci://ghcr.io/adiii717/kube-ai-sre-agent \
--version 0.1.0 \
-f values.yamlThe agent prevents alert noise through smart deduplication and escalation:
Default behavior (balanced):
deduplication:
cooldownMinutes: 5
escalation:
enabled: true
threshold: 10
silenceDurationMinutes: 60→ If a pod crashes 10 times within 5 minutes, silence it for 1 hour
Aggressive (quick to silence):
deduplication:
cooldownMinutes: 2
escalation:
threshold: 3
silenceDurationMinutes: 30→ If crashes 3 times in 2 minutes, silence for 30 minutes
Conservative (tolerant of transient issues):
deduplication:
cooldownMinutes: 10
escalation:
threshold: 20
silenceDurationMinutes: 120→ If crashes 20 times in 10 minutes, silence for 2 hours
Disable escalation entirely:
deduplication:
escalation:
enabled: false→ Only basic deduplication (no silencing)
Get AI-powered incident analysis delivered to your Slack channel:
1. Create a Slack Incoming Webhook:
- Go to https://api.slack.com/apps
- Create a new app or select existing
- Enable "Incoming Webhooks" and create a webhook for your channel
- Copy the webhook URL (looks like
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX)
2. Install with Slack enabled:
helm install kube-ai-sre-agent oci://ghcr.io/adiii717/kube-ai-sre-agent \
--set llm.apiKey=YOUR_API_KEY \
--set slack.enabled=true \
--set slack.webhook=YOUR_SLACK_WEBHOOK_URL3. Slack Notification Example:
You'll receive a formatted message with:
- 🚨 Incident alert header
- Pod name and event type
- AI-powered root cause analysis
- Immediate fix steps
- Prevention recommendations
Disable Slack (analysis only):
helm install kube-ai-sre-agent ... --set slack.enabled=false# Check controller is running
kubectl get pods -l app.kubernetes.io/name=kube-ai-sre-agent
# View logs
kubectl logs -l app.kubernetes.io/component=controller -f
# Watch for analysis jobs
kubectl get jobs -whelm uninstall kube-ai-sre-agent- Real-time CrashLoopBackOff detection
- ImagePullBackOff monitoring
- Health check failure alerts
- Multi-LLM support (Gemini, Claude, OpenAI)
- Slack notifications
- PagerDuty integration
- Custom event handlers
# Clone repo
git clone https://github.com/adiii717/kube-ai-sre-agent.git
cd kube-ai-sre-agent
# Build binaries
make build
# Build Docker images (uses your local arch - arm64 on M1/M2)
docker build -t ghcr.io/adiii717/kube-ai-sre-agent-controller:local -f Dockerfile.controller .
docker build -t ghcr.io/adiii717/kube-ai-sre-agent-analyzer:local -f Dockerfile.analyzer .
# Install with local images
helm install kube-ai-sre-agent ./helm/kube-ai-sre-agent \
--set controller.image.tag=local \
--set analyzer.image.tag=local \
--set llm.provider=gemini \
--set llm.apiKey=YOUR_KEY \
--set slack.enabled=falseMIT
