<<<<<<< HEAD
AI-Powered Kubernetes Monitoring & Self-Healing
While getting started with Kubernetes, my pods kept crashing and I had no idea why. Reading logs, describing pods, googling errors — it was a lot of noise for what should've been a simple answer.
So I built a system that watches your cluster, tells you exactly what's wrong and why, and restarts what it can — without you having to dig through kubectl output at 2am.
Turns out this problem doesn't go away when you scale up. OpsAgent is for anyone running k8s who'd rather get a Slack message that says "pod X crashed because of OOM, restarted successfully" than wake up to a down system.
OpsAgent runs inside your cluster as a background worker. Every 60 seconds it:
1. polls your cluster for pod health (prometheus + k8s API)
2. detects failures and crash loops (status=Failed OR restarts > 5)
3. analyzes cluster state (diagnoses what went wrong and why)
4. alerts you on Slack with full context (what broke, why, what it's doing)
5. auto-heals if enabled (restarts the pod, logs the action)
6. exposes traces and metrics (otel + jaeger + /metrics endpoint)
No dashboards to configure. No alert rules to write. It just works.
| Component | Role |
|---|---|
worker.py |
Background orchestrator, runs the monitoring loop |
| Prometheus | Continuously scrapes k8s metrics |
| OpenTelemetry | Traces internal operations (AI latency, heal time) |
| Jaeger | Visualizes OTel traces |
| Groq / Llama 3.3-70B | Cluster diagnosis and healing decisions |
| Slack | Alerts maintainers with context |
| Helm | Packages and deploys OpsAgent into your cluster |
- Kubernetes cluster (minikube, EKS, GKE, or any)
- Helm 3 installed
- Groq API key → console.groq.com
- Slack webhook URL → api.slack.com/messaging/webhooks
helm install opsagent ./charts/opsagent \
--set env.groqApiKey=your_groq_key \
--set env.slackWebhookUrl=your_slack_webhookThat's it. OpsAgent starts monitoring your cluster immediately.
helm install opsagent ./charts/opsagent \
--set env.groqApiKey=your_groq_key \
--set env.slackWebhookUrl=your_slack_webhook \
--set env.autoHealEnabled=true| Variable | Description | Default |
|---|---|---|
env.groqApiKey |
Groq API key for AI analysis | "" |
env.slackWebhookUrl |
Slack webhook for alerts | "" |
env.autoHealEnabled |
Auto-restart failed pods | false |
replicaCount |
Number of OpsAgent replicas | 1 |
image.tag |
OpsAgent image tag | latest |
- Pod status (Failed, Pending, CrashLoopBackOff)
- Restart counts (alerts if restarts > 5)
- K8s cluster connectivity
- AI analysis latency
- Slack alert delivery
- Auto-heal success/failure history
| Endpoint | Description |
|---|---|
GET /health |
OpsAgent + k8s connection status |
GET /metrics |
Prometheus metrics |
GET /telemetry |
OTel spans (in-memory dashboard) |
GET /settings |
Current config (auto-heal, poll interval) |
POST /settings/toggle-heal |
Toggle auto-healing on/off |
| OpsAgent | Datadog | k8sgpt | |
|---|---|---|---|
| Always-on monitoring | ✅ | ✅ | ❌ (CLI, manual) |
| Root cause diagnosis | ✅ | ❌ | ✅ |
| Auto-healing | ✅ | ❌ | ❌ |
| Slack alerts | ✅ | ✅ | ❌ |
| Free / open source | ✅ | ❌ | ✅ |
| Self-hosted | ✅ | ❌ | ✅ |
- k8sgpt integration (deeper analysis across 13+ resource types — pods, PVCs, ingress, nodes and more)
- Expand healing actions (scale deployments, delete stuck jobs, drain nodes)
- Node and deployment analysis (beyond pods)
- Multi-cluster support
- OpsAgent Cloud (hosted version, connect your cluster in 2 mins)
PRs welcome. If you run OpsAgent on a real cluster and find something broken or missing, open an issue.
AI-powered Kubernetes monitoring and auto-healing platform. OpsAgent watches your cluster, diagnoses issues using LLMs, and heals crashed pods automatically — with Slack alerts delivered in real time.
Live Demo → opsagent-five.vercel.app
OpsAgent continuously monitors your Kubernetes cluster and uses Groq's Llama 3 model to diagnose pod failures. When a pod crashes, OpsAgent restarts it automatically and sends a Slack notification — no manual intervention needed.
- 🔍 Real-time pod monitoring via Kubernetes API
- 🧠 AI diagnostics powered by Groq / Llama 3.3 70B
- 🔧 Auto-healing — detects crashed pods and restarts them automatically
- 💬 Slack alerts on every heal action
- 📊 React dashboard with live event stream and cluster health overview
- 🖥️ Textual TUI for terminal-based monitoring
| Layer | Tech |
|---|---|
| Backend | Python, FastAPI |
| AI | Groq API (llama-3.3-70b-versatile) |
| Kubernetes | minikube, kubectl, Python k8s client |
| Frontend | React, Tailwind CSS, Framer Motion |
| Notifications | Slack Webhooks |
| Containerization | Docker |
- Python 3.10+
- Node.js 18+
- Docker Desktop
- minikube
- A Groq API key → console.groq.com
- A Slack Webhook URL → Slack Incoming Webhooks
git clone https://github.com/shayannab/opsagent.git
cd opsagentCreate a .env file in the backend/ directory:
GROQ_API_KEY=your_groq_api_key
SLACK_WEBHOOK_URL=your_slack_webhook_urlminikube startcd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000cd frontend
npm install
npm run devFrontend runs at http://localhost:3000
- OpsAgent polls your minikube cluster every few seconds
- If a pod enters
CrashLoopBackOfforFailedstate, it's flagged - Groq / Llama 3 diagnoses the failure and generates a summary
- OpsAgent restarts the pod via the Kubernetes API
- A Slack notification is sent with pod name, status, and AI diagnosis
- Go to api.slack.com/apps and create an app
- Enable Incoming Webhooks and add it to your workspace
- Copy the webhook URL
- Paste it as
SLACK_WEBHOOK_URLin your.envfile
opsagent/
├── frontend/ # React App
│ └── src/
├── charts/ # Helm charts
├── models/ # Data models
├── routes/ # FastAPI route handlers
├── services/ # Business logic & integrations
├── tests/ # Test suite
├── main.py # FastAPI app entrypoint
├── worker.py # Background worker
├── start.py # Startup script
├── Dockerfile
├── requirements.txt
└── README.md
Built by Shayanna
MIT
da695bcb5f569ba1086017a5a6a74fb9243a9ee2
