OpsAgent 🤖

<<<<<<< HEAD

OpsAgent 🤖

AI-Powered Kubernetes Monitoring & Self-Healing

While getting started with Kubernetes, my pods kept crashing and I had no idea why. Reading logs, describing pods, googling errors — it was a lot of noise for what should've been a simple answer.

So I built a system that watches your cluster, tells you exactly what's wrong and why, and restarts what it can — without you having to dig through kubectl output at 2am.

Turns out this problem doesn't go away when you scale up. OpsAgent is for anyone running k8s who'd rather get a Slack message that says "pod X crashed because of OOM, restarted successfully" than wake up to a down system.

How it works

OpsAgent runs inside your cluster as a background worker. Every 60 seconds it:

1. polls your cluster for pod health        (prometheus + k8s API)
2. detects failures and crash loops         (status=Failed OR restarts > 5)
3. analyzes cluster state                   (diagnoses what went wrong and why)
4. alerts you on Slack with full context    (what broke, why, what it's doing)
5. auto-heals if enabled                    (restarts the pod, logs the action)
6. exposes traces and metrics               (otel + jaeger + /metrics endpoint)

No dashboards to configure. No alert rules to write. It just works.

Stack

Component	Role
`worker.py`	Background orchestrator, runs the monitoring loop
Prometheus	Continuously scrapes k8s metrics
OpenTelemetry	Traces internal operations (AI latency, heal time)
Jaeger	Visualizes OTel traces
Groq / Llama 3.3-70B	Cluster diagnosis and healing decisions
Slack	Alerts maintainers with context
Helm	Packages and deploys OpsAgent into your cluster

Prerequisites

Kubernetes cluster (minikube, EKS, GKE, or any)
Helm 3 installed
Groq API key → console.groq.com
Slack webhook URL → api.slack.com/messaging/webhooks

Install

helm install opsagent ./charts/opsagent \
  --set env.groqApiKey=your_groq_key \
  --set env.slackWebhookUrl=your_slack_webhook

That's it. OpsAgent starts monitoring your cluster immediately.

Optional: enable auto-healing

helm install opsagent ./charts/opsagent \
  --set env.groqApiKey=your_groq_key \
  --set env.slackWebhookUrl=your_slack_webhook \
  --set env.autoHealEnabled=true

Configuration

Variable	Description	Default
`env.groqApiKey`	Groq API key for AI analysis	`""`
`env.slackWebhookUrl`	Slack webhook for alerts	`""`
`env.autoHealEnabled`	Auto-restart failed pods	`false`
`replicaCount`	Number of OpsAgent replicas	`1`
`image.tag`	OpsAgent image tag	`latest`

What it monitors

Pod status (Failed, Pending, CrashLoopBackOff)
Restart counts (alerts if restarts > 5)
K8s cluster connectivity
AI analysis latency
Slack alert delivery
Auto-heal success/failure history

Endpoints

Endpoint	Description
`GET /health`	OpsAgent + k8s connection status
`GET /metrics`	Prometheus metrics
`GET /telemetry`	OTel spans (in-memory dashboard)
`GET /settings`	Current config (auto-heal, poll interval)
`POST /settings/toggle-heal`	Toggle auto-healing on/off

vs other tools

	OpsAgent	Datadog	k8sgpt
Always-on monitoring	✅	✅	❌ (CLI, manual)
Root cause diagnosis	✅	❌	✅
Auto-healing	✅	❌	❌
Slack alerts	✅	✅	❌
Free / open source	✅	❌	✅
Self-hosted	✅	❌	✅

Roadmap

k8sgpt integration (deeper analysis across 13+ resource types — pods, PVCs, ingress, nodes and more)
Expand healing actions (scale deployments, delete stuck jobs, drain nodes)
Node and deployment analysis (beyond pods)
Multi-cluster support
OpsAgent Cloud (hosted version, connect your cluster in 2 mins)

Contributing

PRs welcome. If you run OpsAgent on a real cluster and find something broken or missing, open an issue.

Built with FastAPI, Groq, Prometheus, OpenTelemetry, and Helm.

OpsAgent 🤖

AI-powered Kubernetes monitoring and auto-healing platform. OpsAgent watches your cluster, diagnoses issues using LLMs, and heals crashed pods automatically — with Slack alerts delivered in real time.

Live Demo → opsagent-five.vercel.app

What it does

OpsAgent continuously monitors your Kubernetes cluster and uses Groq's Llama 3 model to diagnose pod failures. When a pod crashes, OpsAgent restarts it automatically and sends a Slack notification — no manual intervention needed.

🔍 Real-time pod monitoring via Kubernetes API
🧠 AI diagnostics powered by Groq / Llama 3.3 70B
🔧 Auto-healing — detects crashed pods and restarts them automatically
💬 Slack alerts on every heal action
📊 React dashboard with live event stream and cluster health overview
🖥️ Textual TUI for terminal-based monitoring

Tech Stack

Layer	Tech
Backend	Python, FastAPI
AI	Groq API (llama-3.3-70b-versatile)
Kubernetes	minikube, kubectl, Python k8s client
Frontend	React, Tailwind CSS, Framer Motion
Notifications	Slack Webhooks
Containerization	Docker

Getting Started

Prerequisites

Python 3.10+
Node.js 18+
Docker Desktop
minikube
A Groq API key → console.groq.com
A Slack Webhook URL → Slack Incoming Webhooks

1. Clone the repo

git clone https://github.com/shayannab/opsagent.git
cd opsagent

2. Set up environment variables

Create a .env file in the backend/ directory:

GROQ_API_KEY=your_groq_api_key
SLACK_WEBHOOK_URL=your_slack_webhook_url

3. Start minikube

minikube start

4. Run the backend

cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

5. Run the frontend

cd frontend
npm install
npm run dev

Frontend runs at http://localhost:3000

How auto-healing works

OpsAgent polls your minikube cluster every few seconds
If a pod enters CrashLoopBackOff or Failed state, it's flagged
Groq / Llama 3 diagnoses the failure and generates a summary
OpsAgent restarts the pod via the Kubernetes API
A Slack notification is sent with pod name, status, and AI diagnosis

Slack Setup

Go to api.slack.com/apps and create an app
Enable Incoming Webhooks and add it to your workspace
Copy the webhook URL
Paste it as SLACK_WEBHOOK_URL in your .env file

Project Structure

opsagent/
├── frontend/          # React App
│   └── src/
├── charts/            # Helm charts
├── models/            # Data models
├── routes/            # FastAPI route handlers
├── services/          # Business logic & integrations
├── tests/             # Test suite
├── main.py            # FastAPI app entrypoint
├── worker.py          # Background worker
├── start.py           # Startup script
├── Dockerfile
├── requirements.txt
└── README.md

Author

Built by Shayanna

License

MIT

da695bcb5f569ba1086017a5a6a74fb9243a9ee2

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
charts/opsagent		charts/opsagent
frontend		frontend
models		models
routes		routes
services		services
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
dashboard.html		dashboard.html
debug2.py		debug2.py
debug_alerts.py		debug_alerts.py
debug_output.txt		debug_output.txt
docker-compose.yaml		docker-compose.yaml
main.py		main.py
notes		notes
opsagent-deployment.yaml		opsagent-deployment.yaml
otel-collector-config.yaml		otel-collector-config.yaml
requirements.txt		requirements.txt
start.py		start.py
startInfo		startInfo
telemetry.py		telemetry.py
telemetry_ui.py		telemetry_ui.py
tui.py		tui.py
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpsAgent 🤖

How it works

Stack

Prerequisites

Install

Optional: enable auto-healing

Configuration

What it monitors

Endpoints

vs other tools

Roadmap

Contributing

Built with FastAPI, Groq, Prometheus, OpenTelemetry, and Helm.

OpsAgent 🤖

What it does

Tech Stack

Getting Started

Prerequisites

1. Clone the repo

2. Set up environment variables

3. Start minikube

4. Run the backend

5. Run the frontend

How auto-healing works

Slack Setup

Project Structure

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpsAgent 🤖

How it works

Stack

Prerequisites

Install

Optional: enable auto-healing

Configuration

What it monitors

Endpoints

vs other tools

Roadmap

Contributing

Built with FastAPI, Groq, Prometheus, OpenTelemetry, and Helm.

OpsAgent 🤖

What it does

Tech Stack

Getting Started

Prerequisites

1. Clone the repo

2. Set up environment variables

3. Start minikube

4. Run the backend

5. Run the frontend

How auto-healing works

Slack Setup

Project Structure

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages