🧠 Deep Agent System for Autonomous CI/CD Healing
Most hackathon teams build: "AI that explains logs"
We built: AI that changes reality
┌──────────────────────┐
│ React UI │ ← Visualization + Trigger
└──────────┬───────────┘
│ HTTP
┌──────────▼────────────────────────┐
│ AI Backend (LangGraph System) │ ← Deep Agent Brain
│ │
│ • Supervisor Agent │
│ • Failure Detection Agent │
│ • Root Cause Agent │
│ • Fix Generator Agent │
│ • Validation Agent │
│ • Learning Agent │
└──────────┬────────────────────────┘
│
┌──────────▼─────────┐
│ Postgres │ ← Organizational Memory
│ Memory + Learning │
└────────────────────┘
LangGraph = Operating System
Agents = Processes (not containers)
Agents are runtime graphs that live in the backend, not separate services.
- Docker & Docker Compose
- OpenAI API Key
- GitHub Token (optional, for webhook integration)
- Clone and navigate to the project:
cd reforge- Configure environment variables:
Backend:
cp backend/.env.example backend/.env
# Edit backend/.env and add your OPENAI_API_KEYUI:
cp ui/.env.example ui/.env
# Default backend URL is already set- Start the system:
docker compose up --buildThis will start:
- Postgres on port 5432
- Backend on port 8000
- UI on port 3000
- Access the dashboard:
http://localhost:3000
Supervisor Agent
│
▼
Failure Detection Agent
│
▼
Root Cause Agent
│
▼
Fix Generator Agent
│
▼
Validation Agent
│
├── success → Learning Agent
└── fail → Retry Reasoning Loop
- Supervisor Agent: Orchestrates the entire healing process, decides next steps
- Failure Detection Agent: Categorizes failures and assesses severity
- Root Cause Agent: Performs deep analysis to find the actual root cause
- Fix Generator Agent: Creates specific fixes (YAML, config, code changes)
- Validation Agent: Validates if the fix will work before applying
- Learning Agent: Extracts patterns and updates knowledge base
The system maintains organizational memory through:
pipeline_runs- All pipeline executionsfailures- Detected failures with contextapplied_fixes- Fixes that were appliedagent_reasoning- Complete thought process of agentssuccess_metrics- Success rates and healing metricslearned_patterns- Knowledge base of patternsgenerated_pipelines- AI-generated pipeline configs
- Open the dashboard at
http://localhost:3000 - Click "🔥 Trigger Test Healing"
- Watch the Deep Agent System:
- Detect the failure
- Analyze root cause
- Generate a fix
- Validate the solution
- Learn from the experience
- Check the reasoning chain in browser console
- View metrics and agent status on dashboard
- Go to your GitHub repository settings
- Add webhook:
http://your-server:8000/api/webhook/github - Select events:
Workflow runs - Add webhook secret to
backend/.env
When a workflow fails:
- GitHub sends webhook
- SAIGE detects failure
- Healing process starts automatically
- Fix is applied
- Workflow is retriggered
reforge/
├── docker-compose.yml # 3-container orchestration
├── backend/
│ ├── main.py # Deep Agent System (ONLY Python file)
│ ├── Dockerfile
│ ├── requirements.txt
│ └── .env.example
├── ui/
│ ├── src/
│ │ ├── pages/
│ │ │ └── Dashboard.tsx # Main dashboard
│ │ ├── App.tsx
│ │ └── main.tsx
│ ├── Dockerfile
│ ├── package.json
│ └── vite.config.ts
└── db/
└── init.sql # Database schema
GET /- System statusGET /health- Health checkPOST /api/heal- Trigger healing processPOST /api/webhook/github- GitHub webhook receiverGET /api/pipelines- List pipeline runsGET /api/failures- List failuresGET /api/learnings- List learned patternsGET /api/metrics- Get success metrics
curl -X POST http://localhost:8000/api/heal \
-H "Content-Type: application/json" \
-d '{
"failure": {
"run_id": "test-123",
"repository": "demo/test-repo",
"branch": "main",
"commit_sha": "abc123",
"stage": "build",
"error_message": "Build failed: npm install error",
"log_snippet": "Error: Cannot find module \"express\"",
"timestamp": "2024-01-01T00:00:00Z"
},
"auto_apply": true
}'- Not a chatbot - It's an autonomous system that takes action
- Deep Agent Architecture - Real LangGraph implementation, not simple chaining
- Organizational Memory - Learns and improves over time
- Event-Driven - Webhook integration, not polling
- Production-Ready - Proper database, containerization, error handling
cd backend
pip install -r requirements.txt
python main.pycd ui
npm install
npm run devdocker exec -it saige_postgres psql -U saige -d cicd_aiOPENAI_API_KEY=your_key_here
GITHUB_TOKEN=your_token_here
GITHUB_WEBHOOK_SECRET=your_secret_here
DATABASE_URL=postgresql://saige:saige@postgres:5432/cicd_ai
VITE_BACKEND_URL=http://localhost:8000
This system demonstrates:
- LangGraph for agent orchestration
- Event-driven architecture with webhooks
- Persistent memory with PostgreSQL
- Learning systems that improve over time
- Real-world DevOps automation
- Multi-repository learning
- Advanced pattern recognition with embeddings
- Slack/Discord notifications
- Pipeline generation from natural language
- A/B testing of fixes
- Cost optimization recommendations
MIT License - See LICENSE file
This is a hackathon project, but contributions are welcome!
Built with 🧠 by the SAIGE Team
Autonomous DevOps Intelligence - Not just explaining logs, but changing reality