kube-ai-sre-agent

Lightweight, event-driven Kubernetes incident response with AI-powered root cause analysis.

What Makes This Different

Event-driven by design Watches Kubernetes events in real-time. No manual scans, no external alerting required.

Minimal footprint Tiny controller (~10-20MB) runs 24/7. Heavy analysis runs as on-demand Kubernetes Jobs.

Predictable costs

Controller: ~$0.50/month (always-on, minimal resources)
Analysis Jobs: $0.001 per incident (spawn → analyze → terminate)

Config-driven

events:
  crashLoopBackOff: true
  imagePullBackOff: true
  healthCheckFailure: false
llm:
  provider: gemini  # or claude, openai

Architecture

Event occurs → Controller detects → Spawns Job → Analyzes with LLM → Slack notification → Job terminates

Controller: Watches K8s events, spawns Jobs (10m CPU, 20MB RAM)
Analysis Job: Fetches logs, calls LLM, sends alerts (500m CPU, 512MB RAM, 30-60s)

Installation

Prerequisites

Kubernetes 1.24+
Helm 3.8+
LLM API key (Gemini, Claude, or OpenAI)
Slack webhook URL (optional)

Install from OCI Registry

# Install with Gemini
helm install kube-ai-sre-agent oci://ghcr.io/adiii717/kube-ai-sre-agent \
  --version 0.1.0 \
  --set llm.provider=gemini \
  --set llm.apiKey=YOUR_GEMINI_API_KEY \
  --set slack.webhook=YOUR_SLACK_WEBHOOK

# Install with Claude
helm install kube-ai-sre-agent oci://ghcr.io/adiii717/kube-ai-sre-agent \
  --version 0.1.0 \
  --set llm.provider=claude \
  --set llm.apiKey=YOUR_CLAUDE_API_KEY \
  --set slack.webhook=YOUR_SLACK_WEBHOOK

Note for Apple Silicon (M1/M2) users: Published images are amd64 only. For local testing on arm64, build from source (see below).

Install from Source

git clone https://github.com/adiii717/kube-ai-sre-agent.git
cd kube-ai-sre-agent

helm install kube-ai-sre-agent ./helm/kube-ai-sre-agent \
  --set llm.provider=gemini \
  --set llm.apiKey=YOUR_API_KEY \
  --set slack.webhook=YOUR_WEBHOOK

Configuration

Create a values.yaml file:

# Enable/disable specific event types
events:
  crashLoopBackOff: true
  imagePullBackOff: true
  healthCheckFailure: true
  oomKilled: true

# LLM provider configuration
llm:
  provider: gemini  # gemini, claude, or openai
  apiKey: "your-api-key"

# Slack notifications
slack:
  enabled: true
  webhook: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

# Resource limits
controller:
  resources:
    requests:
      cpu: 10m
      memory: 20Mi
    limits:
      cpu: 100m
      memory: 64Mi

analyzer:
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi

# Alert deduplication and noise reduction
deduplication:
  # Don't re-analyze same incident within this window
  cooldownMinutes: 5

  # Smart escalation - silence noisy incidents
  escalation:
    enabled: true
    threshold: 10           # Silence after this many incidents
    silenceDurationMinutes: 60  # Silence for this duration

Install with custom values:

helm install kube-ai-sre-agent oci://ghcr.io/adiii717/kube-ai-sre-agent \
  --version 0.1.0 \
  -f values.yaml

Alert Deduplication & Escalation

The agent prevents alert noise through smart deduplication and escalation:

Default behavior (balanced):

deduplication:
  cooldownMinutes: 5
  escalation:
    enabled: true
    threshold: 10
    silenceDurationMinutes: 60

→ If a pod crashes 10 times within 5 minutes, silence it for 1 hour

Aggressive (quick to silence):

deduplication:
  cooldownMinutes: 2
  escalation:
    threshold: 3
    silenceDurationMinutes: 30

→ If crashes 3 times in 2 minutes, silence for 30 minutes

Conservative (tolerant of transient issues):

deduplication:
  cooldownMinutes: 10
  escalation:
    threshold: 20
    silenceDurationMinutes: 120

→ If crashes 20 times in 10 minutes, silence for 2 hours

Disable escalation entirely:

deduplication:
  escalation:
    enabled: false

→ Only basic deduplication (no silencing)

Slack Notifications

Get AI-powered incident analysis delivered to your Slack channel:

1. Create a Slack Incoming Webhook:

Go to https://api.slack.com/apps
Create a new app or select existing
Enable "Incoming Webhooks" and create a webhook for your channel
Copy the webhook URL (looks like https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX)

2. Install with Slack enabled:

helm install kube-ai-sre-agent oci://ghcr.io/adiii717/kube-ai-sre-agent \
  --set llm.apiKey=YOUR_API_KEY \
  --set slack.enabled=true \
  --set slack.webhook=YOUR_SLACK_WEBHOOK_URL

3. Slack Notification Example:

You'll receive a formatted message with:

🚨 Incident alert header
Pod name and event type
AI-powered root cause analysis
Immediate fix steps
Prevention recommendations

Disable Slack (analysis only):

helm install kube-ai-sre-agent ... --set slack.enabled=false

Verify Installation

# Check controller is running
kubectl get pods -l app.kubernetes.io/name=kube-ai-sre-agent

# View logs
kubectl logs -l app.kubernetes.io/component=controller -f

# Watch for analysis jobs
kubectl get jobs -w

Uninstall

helm uninstall kube-ai-sre-agent

Features

Development

Build Locally (Mac/Linux)

# Clone repo
git clone https://github.com/adiii717/kube-ai-sre-agent.git
cd kube-ai-sre-agent

# Build binaries
make build

# Build Docker images (uses your local arch - arm64 on M1/M2)
docker build -t ghcr.io/adiii717/kube-ai-sre-agent-controller:local -f Dockerfile.controller .
docker build -t ghcr.io/adiii717/kube-ai-sre-agent-analyzer:local -f Dockerfile.analyzer .

# Install with local images
helm install kube-ai-sre-agent ./helm/kube-ai-sre-agent \
  --set controller.image.tag=local \
  --set analyzer.image.tag=local \
  --set llm.provider=gemini \
  --set llm.apiKey=YOUR_KEY \
  --set slack.enabled=false

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
cmd		cmd
docs		docs
helm/kube-ai-sre-agent		helm/kube-ai-sre-agent
pkg		pkg
.gitignore		.gitignore
Dockerfile.analyzer		Dockerfile.analyzer
Dockerfile.controller		Dockerfile.controller
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

kube-ai-sre-agent

What Makes This Different

Architecture

Installation

Prerequisites

Install from OCI Registry

Install from Source

Configuration

Alert Deduplication & Escalation

Slack Notifications

Verify Installation

Uninstall

Features

Development

Build Locally (Mac/Linux)

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

adiii717/kube-ai-sre-agent

Folders and files

Latest commit

History

Repository files navigation

kube-ai-sre-agent

What Makes This Different

Architecture

Installation

Prerequisites

Install from OCI Registry

Install from Source

Configuration

Alert Deduplication & Escalation

Slack Notifications

Verify Installation

Uninstall

Features

Development

Build Locally (Mac/Linux)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages