AI-Powered Kubernetes Operations Platform
Kubernaut is an open-source Kubernetes AIOps platform that combines AI-driven investigation with automated remediation. It analyzes Kubernetes incidents, orchestrates multi-step remediation workflows, and executes validated actionsβtargeting mean time to resolution reduction from 60 minutes to under 5 minutes while maintaining operational safety.
Kubernaut automates the entire incident response lifecycle for Kubernetes:
- Signal Ingestion: Receives alerts from Prometheus AlertManager and Kubernetes Events
- AI Analysis: Uses HolmesGPT for root cause analysis and remediation recommendations
- Workflow Orchestration: Executes multi-step remediation playbooks via Tekton Pipelines
- Continuous Learning: Tracks effectiveness and improves recommendations over time
- Multi-Source Signal Processing: Prometheus alerts, Kubernetes events, with deduplication and storm detection
- AI-Powered Root Cause Analysis: HolmesGPT integration for intelligent investigation
- Remediation Playbooks: Industry-standard, versioned remediation patterns (PagerDuty/Google SRE-aligned)
- Safety-First Execution: Comprehensive validation, dry-run mode, and rollback capabilities
- Continuous Learning: Multi-dimensional effectiveness tracking (incident type, playbook, action)
- Production-Ready: 289 tests passing, 95% confidence across all services
Kubernaut follows a microservices architecture with 10 services (4 CRD controllers + 6 stateless services):
- Gateway Service receives signals (Prometheus alerts, K8s events) and creates
RemediationRequestCRDs - Remediation Orchestrator coordinates the lifecycle across 4 specialized CRD controllers:
- Signal Processing: Enriches signals with Kubernetes context
- AI Analysis: Performs HolmesGPT investigation and generates recommendations
- Remediation Execution: Orchestrates Tekton Pipelines for multi-step workflows
- Notification: Delivers multi-channel notifications (Slack, Email, etc.)
- Data Storage Service provides centralized PostgreSQL access (ADR-032)
- Effectiveness Monitor tracks outcomes and feeds learning back to AI
Kubernaut uses Kubernetes Custom Resources (CRDs) for all inter-service communication, enabling:
- Event-driven, resilient workflows
- Built-in retry and reconciliation
- Complete audit trail
- Horizontal scaling
Current Phase: Phases 3 & 4 Running Simultaneously - 4 of 8 services production-ready (50%)
| Service | Status | Purpose | BR Coverage |
|---|---|---|---|
| Gateway Service | β v1.0 PRODUCTION-READY | Signal ingestion & deduplication | 20 BRs (240 tests: 120U+114I+6E2E) |
| Data Storage Service | β Phase 1 PRODUCTION-READY | REST API Gateway for PostgreSQL (ADR-032) | 34 BRs (~535 tests) |
| HolmesGPT API | β v3.2 PRODUCTION-READY | AI investigation wrapper | 47 BRs (172 tests, RFC 7807) |
| Notification Service | β PRODUCTION-READY | Multi-channel delivery | 12 BRs (249 tests: 140U+97I+12E2E) |
| Signal Processing | π Phase 3 (In Progress) | Signal enrichment | - |
| AI Analysis | π Phase 4 (In Progress) | AI-powered analysis | - |
| Remediation Execution | π Phase 3 (In Progress) | Tekton workflow orchestration | - |
| Remediation Orchestrator | βΈοΈ Phase 5 | Cross-CRD coordination | - |
| β Deferred to V2.0 | Service discovery (DD-016) | 8 BRs (redundant with HolmesGPT-API) | |
| β Deferred to V1.1 | Continuous improvement (DD-017) | 10 BRs (requires 8+ weeks of data) |
Timeline: V1.0 target: End of December 2025 | Parallel development strategy: Phases 3 & 4 running simultaneously
Recent Updates (December 1, 2025):
- π Parallel Phase Development: Phase 3 (Signal Processing + Remediation Execution) and Phase 4 (AI Analysis) running simultaneously to validate API contracts and prevent integration rework
- βΈοΈ Effectiveness Monitor Deferred to V1.1: Per DD-017, deferred to V1.1 due to year-end timeline constraints (requires 8+ weeks of remediation data for meaningful assessments)
- β Notification Service Production-Ready: 249 tests (140U+97I+12E2E), Kind-based E2E, DD-TEST-001 compliant, zero flaky tests
- βΈοΈ Dynamic Toolset Deferred to V2.0: Per DD-016, deferred to V2.0 (V1.x uses static config, redundant with HolmesGPT-API's built-in Prometheus discovery)
- β HolmesGPT API v3.2: Recovery prompt implementation complete (DD-RECOVERY-002/003), DetectedLabels integration, RFC 7807 + Graceful Shutdown
- β Gateway Service v1.0: 240 tests (120U+114I+6E2E), 20 BRs, production-ready
- β Data Storage Service Phase 1: Unified audit table (ADR-034), PostgreSQL access layer (ADR-032), ~535 tests
- π V1.0 Service Count: 8 active services (11 original - Context API deprecated - Dynamic Toolset deferred - Effectiveness Monitor deferred)
- Go 1.23.9+ for building services
- Kubernetes cluster (Kind recommended for development)
- Redis (for Gateway service deduplication)
- PostgreSQL (for data persistence)
- kubectl with cluster access
# Install CRDs
make install
# Build all CRD controllers (single binary for development)
make build
# Creates: bin/manager (includes all CRD controllers)
# Build individual services
go build -o bin/gateway-service ./cmd/gateway
go build -o bin/dynamic-toolset ./cmd/dynamictoolset
go build -o bin/data-storage ./cmd/datastorage# Setup Kind cluster for testing
make test-gateway-setup
# Run tests by tier
make test # Unit tests (70%+ coverage)
make test-integration # Integration tests (>50% coverage)
make test-e2e # End-to-end tests (<10% coverage)
# Clean up
make test-gateway-teardownKubernaut services use Kustomize overlays for cross-platform deployment (OpenShift + vanilla Kubernetes).
| Service | Status | Deployment Path |
|---|---|---|
| Gateway + Redis | β Production-Ready | deploy/gateway/ |
| HolmesGPT API | βΈοΈ Coming Soon | deploy/holmesgpt-api/ |
| PostgreSQL | βΈοΈ Coming Soon | deploy/postgres/ |
# Deploy Gateway + Redis to OpenShift
oc apply -k deploy/gateway/overlays/openshift/
# Verify
oc get pods -n kubernaut-system -l app.kubernetes.io/component=gateway# Deploy Gateway + Redis to Kubernetes
kubectl apply -k deploy/gateway/overlays/kubernetes/
# Verify
kubectl get pods -n kubernaut-system -l app.kubernetes.io/component=gatewayEach service follows this structure:
deploy/[service]/
βββ base/ # Platform-agnostic manifests
β βββ kustomization.yaml
β βββ *.yaml # K8s resources
βββ overlays/
β βββ openshift/ # OpenShift-specific (SCC fixes)
β β βββ kustomization.yaml
β β βββ patches/
β βββ kubernetes/ # Vanilla K8s (uses base)
β βββ kustomization.yaml
βββ README.md # Service-specific deployment guide
Key Differences:
- OpenShift: Removes hardcoded
runAsUser/fsGroupfor SCC compatibility - Kubernetes: Uses base manifests with explicit security contexts
- Gateway Service: Signal ingestion + deduplication + storm detection
- HolmesGPT API: Coming soon
- PostgreSQL: Coming soon
New to Kubernaut development? Start here:
π Developer Guide β START HERE
Complete onboarding guide for contributors:
- Adding a new service β 12-day implementation plan with APDC-TDD methodology
- Extending existing services β Feature implementation patterns
- Development environment setup β Prerequisites, tools, IDE configuration
- Testing strategy β Defense-in-depth pyramid (Unit 70%+ / Integration >50% / E2E <10%)
- Deployment β Kustomize overlays for OpenShift + Kubernetes
| I want to... | Go to... |
|---|---|
| Implement a new service | SERVICE_IMPLEMENTATION_PLAN_TEMPLATE.md (11-12 days) |
| Extend an existing service | FEATURE_EXTENSION_PLAN_TEMPLATE.md (3-12 days) |
| Document a service | SERVICE_DOCUMENTATION_GUIDE.md |
| Understand architecture | Kubernaut CRD Architecture |
| Learn testing strategy | 03-testing-strategy.mdc |
| Follow Go standards | 02-go-coding-standards.mdc |
- Approved Microservices Architecture: Service boundaries and V1/V2 roadmap
- Multi-CRD Reconciliation Architecture: CRD communication patterns
- CRD Schemas: Authoritative CRD field definitions
- Tekton Execution Architecture: Workflow orchestration with Tekton
- CRD Controllers: RemediationOrchestrator, SignalProcessing, AIAnalysis, WorkflowExecution
- Stateless Services: Gateway, Dynamic Toolset, Data Storage, HolmesGPT API, Notification, Effectiveness Monitor
- Testing Strategy: Defense-in-depth testing pyramid
- CRD Controller Templates: Production-ready scaffolding (saves 40-60% development time)
- Design Decisions: All architectural decisions with alternatives
Kubernaut follows a defense-in-depth testing pyramid:
- Unit Tests: 70%+ coverage - Extensive business logic with external mocks only
- Integration Tests: >50% coverage - Cross-service coordination, CRD-based flows, microservices architecture
- E2E Tests: <10% coverage - Critical end-to-end user journeys
Current Test Status: ~1,196 tests passing (100% pass rate across all tiers)
| Service | Unit Specs | Integration Specs | E2E Specs | Total | Confidence |
|---|---|---|---|---|---|
| Gateway v1.0 | 120 | 114 | 6 (+12 deferred to v1.1) | 240 | 100% |
| Data Storage | 475 | ~60 | - | ~535 | 98% |
| Dynamic Toolset | - | - | - | Deferred to V2.0 | DD-016 |
| Notification Service | 140 | 97 | 12 | 249 | 100% |
| HolmesGPT API v3.2 | 151 | 21 | - | 172 | 98% |
Total: ~886 unit specs + ~292 integration specs + 18 E2E specs = ~1,196 test specs
Note: Gateway v1.0 has 2 E2E specs (Storm TTL, K8s API Rate Limiting), 12 additional E2E tests deferred to v1.1. Notification Service has 12 E2E specs (Kind-based file delivery + metrics validation). Dynamic Toolset (245 tests) deferred to V2.0 per DD-016. Integration spec counts are estimates.
Each CRD controller requires specific Kubernetes permissions. See RBAC documentation for details.
- Gateway Service: Network-level security (NetworkPolicies + TLS)
- CRD Controllers: Kubernetes ServiceAccount authentication
- Inter-service: Service mesh (Istio/Linkerd) with mTLS
- Metrics: All services expose Prometheus metrics on
:9090/metrics - Health Checks:
GET /healthandGET /readyendpoints on all services - Logging: Structured JSON logging with configurable levels
- Tracing: OpenTelemetry support (planned for V1.1)
- Go: Standard conventions with comprehensive error handling
- Testing: Ginkgo/Gomega BDD tests, >70% unit coverage
- Documentation: Comprehensive inline documentation
- CRD Changes: Update CRD_SCHEMAS.md
- Create feature branch from
main - Implement with comprehensive tests
- Follow Service Development Order
- Update relevant documentation
- Code review and merge
Apache License 2.0
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Comprehensive guides in
docs/directory
Kubernaut - Building the next evolution of Kubernetes operations through intelligent, CRD-based microservices that learn and adapt.
Current Status: Phases 3 & 4 Running Simultaneously - 4 of 8 services production-ready (50%) | 1 deferred to V2.0 (DD-016), 1 deferred to V1.1 (DD-017) | Target: End of December 2025 for V1.0 completion
Parallel Development Strategy: Final implementation phases (Phase 3: Signal Processing + Remediation Execution, Phase 4: AI Analysis) running simultaneously to validate API contracts and prevent integration rework. This approach ensures solid cross-service contracts before system integration.
Version: 1.1 Date: 2025-11-15 Status: Updated
- Service Naming Correction: Replaced all instances of "Workflow Engine" with "Remediation Execution Engine" per ADR-035
- Terminology Alignment: Updated to match authoritative naming convention (RemediationExecution CRD, Remediation Execution Engine architectural concept)
- Documentation Consistency: Aligned with NAMING_CONVENTION_REMEDIATION_EXECUTION.md reference document
- Initial document creation