Skip to content

shayannab/OpsAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

<<<<<<< HEAD

OpsAgent Logo

OpsAgent 🤖

AI-Powered Kubernetes Monitoring & Self-Healing

GitHub | Live Link


While getting started with Kubernetes, my pods kept crashing and I had no idea why. Reading logs, describing pods, googling errors — it was a lot of noise for what should've been a simple answer.

So I built a system that watches your cluster, tells you exactly what's wrong and why, and restarts what it can — without you having to dig through kubectl output at 2am.

Turns out this problem doesn't go away when you scale up. OpsAgent is for anyone running k8s who'd rather get a Slack message that says "pod X crashed because of OOM, restarted successfully" than wake up to a down system.


How it works

OpsAgent runs inside your cluster as a background worker. Every 60 seconds it:

1. polls your cluster for pod health        (prometheus + k8s API)
2. detects failures and crash loops         (status=Failed OR restarts > 5)
3. analyzes cluster state                   (diagnoses what went wrong and why)
4. alerts you on Slack with full context    (what broke, why, what it's doing)
5. auto-heals if enabled                    (restarts the pod, logs the action)
6. exposes traces and metrics               (otel + jaeger + /metrics endpoint)

No dashboards to configure. No alert rules to write. It just works.


Stack

Component Role
worker.py Background orchestrator, runs the monitoring loop
Prometheus Continuously scrapes k8s metrics
OpenTelemetry Traces internal operations (AI latency, heal time)
Jaeger Visualizes OTel traces
Groq / Llama 3.3-70B Cluster diagnosis and healing decisions
Slack Alerts maintainers with context
Helm Packages and deploys OpsAgent into your cluster

Prerequisites


Install

helm install opsagent ./charts/opsagent \
  --set env.groqApiKey=your_groq_key \
  --set env.slackWebhookUrl=your_slack_webhook

That's it. OpsAgent starts monitoring your cluster immediately.

Optional: enable auto-healing

helm install opsagent ./charts/opsagent \
  --set env.groqApiKey=your_groq_key \
  --set env.slackWebhookUrl=your_slack_webhook \
  --set env.autoHealEnabled=true

Configuration

Variable Description Default
env.groqApiKey Groq API key for AI analysis ""
env.slackWebhookUrl Slack webhook for alerts ""
env.autoHealEnabled Auto-restart failed pods false
replicaCount Number of OpsAgent replicas 1
image.tag OpsAgent image tag latest

What it monitors

  • Pod status (Failed, Pending, CrashLoopBackOff)
  • Restart counts (alerts if restarts > 5)
  • K8s cluster connectivity
  • AI analysis latency
  • Slack alert delivery
  • Auto-heal success/failure history

Endpoints

Endpoint Description
GET /health OpsAgent + k8s connection status
GET /metrics Prometheus metrics
GET /telemetry OTel spans (in-memory dashboard)
GET /settings Current config (auto-heal, poll interval)
POST /settings/toggle-heal Toggle auto-healing on/off

vs other tools

OpsAgent Datadog k8sgpt
Always-on monitoring ❌ (CLI, manual)
Root cause diagnosis
Auto-healing
Slack alerts
Free / open source
Self-hosted

Roadmap

  • k8sgpt integration (deeper analysis across 13+ resource types — pods, PVCs, ingress, nodes and more)
  • Expand healing actions (scale deployments, delete stuck jobs, drain nodes)
  • Node and deployment analysis (beyond pods)
  • Multi-cluster support
  • OpsAgent Cloud (hosted version, connect your cluster in 2 mins)

Contributing

PRs welcome. If you run OpsAgent on a real cluster and find something broken or missing, open an issue.


Built with FastAPI, Groq, Prometheus, OpenTelemetry, and Helm.

OpsAgent 🤖

AI-powered Kubernetes monitoring and auto-healing platform. OpsAgent watches your cluster, diagnoses issues using LLMs, and heals crashed pods automatically — with Slack alerts delivered in real time.

Live Demo → opsagent-five.vercel.app


What it does

OpsAgent continuously monitors your Kubernetes cluster and uses Groq's Llama 3 model to diagnose pod failures. When a pod crashes, OpsAgent restarts it automatically and sends a Slack notification — no manual intervention needed.

  • 🔍 Real-time pod monitoring via Kubernetes API
  • 🧠 AI diagnostics powered by Groq / Llama 3.3 70B
  • 🔧 Auto-healing — detects crashed pods and restarts them automatically
  • 💬 Slack alerts on every heal action
  • 📊 React dashboard with live event stream and cluster health overview
  • 🖥️ Textual TUI for terminal-based monitoring

Tech Stack

Layer Tech
Backend Python, FastAPI
AI Groq API (llama-3.3-70b-versatile)
Kubernetes minikube, kubectl, Python k8s client
Frontend React, Tailwind CSS, Framer Motion
Notifications Slack Webhooks
Containerization Docker

Getting Started

Prerequisites

1. Clone the repo

git clone https://github.com/shayannab/opsagent.git
cd opsagent

2. Set up environment variables

Create a .env file in the backend/ directory:

GROQ_API_KEY=your_groq_api_key
SLACK_WEBHOOK_URL=your_slack_webhook_url

3. Start minikube

minikube start

4. Run the backend

cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

5. Run the frontend

cd frontend
npm install
npm run dev

Frontend runs at http://localhost:3000


How auto-healing works

  1. OpsAgent polls your minikube cluster every few seconds
  2. If a pod enters CrashLoopBackOff or Failed state, it's flagged
  3. Groq / Llama 3 diagnoses the failure and generates a summary
  4. OpsAgent restarts the pod via the Kubernetes API
  5. A Slack notification is sent with pod name, status, and AI diagnosis

Slack Setup

  1. Go to api.slack.com/apps and create an app
  2. Enable Incoming Webhooks and add it to your workspace
  3. Copy the webhook URL
  4. Paste it as SLACK_WEBHOOK_URL in your .env file

Project Structure

opsagent/
├── frontend/          # React App
│   └── src/
├── charts/            # Helm charts
├── models/            # Data models
├── routes/            # FastAPI route handlers
├── services/          # Business logic & integrations
├── tests/             # Test suite
├── main.py            # FastAPI app entrypoint
├── worker.py          # Background worker
├── start.py           # Startup script
├── Dockerfile
├── requirements.txt
└── README.md

Author

Built by Shayanna


License

MIT

da695bcb5f569ba1086017a5a6a74fb9243a9ee2

About

AI-powered Kubernetes monitoring agent that watches, analyzes, and auto-heals your cluster.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors