Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
./CLAUDE.md
istio-*/
66 changes: 53 additions & 13 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
# CLAUDE.md

> **📋 Navigation:** [🏠 Main README](README.md) • [🎯 Goals & Vision](GOALS.md) • [🚀 Getting Started](docs/getting-started.md) • [📖 Usage Guide](docs/usage.md) • [🏗️ Architecture](docs/architecture.md)

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

> **🎯 Project Context:** This project demonstrates enterprise-grade AI/ML inference patterns. See [GOALS.md](GOALS.md) for complete vision and objectives.

## Project Overview

**Inference-in-a-Box** is a comprehensive Kubernetes-based AI/ML inference platform demonstration showcasing enterprise-grade model serving using cloud-native technologies. It's an infrastructure-as-code project demonstrating production-ready AI/ML deployment patterns with Envoy AI Gateway, Istio service mesh, KServe, and comprehensive observability.
Expand Down Expand Up @@ -72,6 +76,12 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box"}}' \
http://localhost:8085/api/models/my-model/publish

# Publish OpenAI-compatible model with token-based rate limiting
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box", "modelType": "openai", "rateLimiting": {"tokensPerHour": 100000}}}' \
http://localhost:8085/api/models/llama-3-8b/publish

# Update published model configuration
curl -X PUT -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
Expand Down Expand Up @@ -101,6 +111,19 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \

# Get JWT tokens for testing
./scripts/get-jwt-tokens.sh

# Test OpenAI-compatible model
export AI_GATEWAY_URL="http://localhost:8080"
export JWT_TOKEN="<your-jwt-token>"

# Chat completion request
curl -H "Authorization: Bearer $JWT_TOKEN" \
-H "x-ai-eg-model: llama-3-8b" \
$AI_GATEWAY_URL/v1/chat/completions \
-d '{
"model": "llama-3-8b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```

### Build & Container Management
Expand All @@ -120,14 +143,15 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
# Management Service UI & API
kubectl port-forward svc/management-service 8085:80

# Observability Stack
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
kubectl port-forward -n monitoring svc/kiali 20001:20001
# Observability Stack (see docs/usage.md for complete service access)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 # Grafana
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 # Prometheus
kubectl port-forward -n monitoring svc/kiali 20001:20001 # Kiali

# AI Gateway & Auth
kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80
kubectl port-forward -n default svc/jwt-server 8081:8080
# Service Access (see docs/usage.md for complete reference)
kubectl port-forward -n envoy-gateway-system svc/envoy-ai-gateway 8080:80 # AI Gateway
kubectl port-forward svc/management-service 8085:80 # Management UI/API
kubectl port-forward -n default svc/jwt-server 8081:8080 # JWT Server
```

## Architecture
Expand All @@ -139,14 +163,19 @@ This platform implements a **dual-gateway architecture** where external traffic

### Technology Stack Integration
- **Kind Cluster**: Local Kubernetes cluster (`inference-in-a-box`)
- **Envoy AI Gateway**: AI-specific gateway with JWT validation and model routing
- **Envoy AI Gateway**: AI-specific gateway with JWT validation, model routing, and OpenAI API compatibility
- **EnvoyExtensionPolicy**: External processor configuration for AI-specific routing
- **Model-aware routing**: Using x-ai-eg-model header for efficient model selection
- **Protocol translation**: OpenAI to KServe format conversion
- **Istio Service Mesh**: Zero-trust networking with automatic mTLS between services
- **KServe**: Kubernetes-native serverless model serving with auto-scaling
- **Knative**: Serverless framework enabling scale-to-zero capabilities
- **Management Service**: Go backend with embedded React frontend for platform administration
- **Model Publishing**: Full-featured model publishing and management system
- **Public Hostname Configuration**: Configurable external access via `api.router.inference-in-a-box`
- **Rate Limiting**: Per-model rate limiting with configurable limits
- **Rate Limiting**: Per-model rate limiting with configurable limits (requests and tokens)
- **OpenAI Compatibility**: Automatic detection and configuration for LLM models
- **Model Testing**: Interactive inference testing with support for both traditional and OpenAI formats

### Multi-Tenant Architecture
- **Tenant Namespaces**: `tenant-a`, `tenant-b`, `tenant-c` with complete resource isolation
Expand All @@ -155,19 +184,24 @@ This platform implements a **dual-gateway architecture** where external traffic

### Serverless Model Serving
- **KServe InferenceServices**: Auto-scaling model endpoints with scale-to-zero capabilities
- **Supported Frameworks**: Scikit-learn, PyTorch, TensorFlow, Hugging Face transformers
- **Supported Frameworks**: Scikit-learn, PyTorch, TensorFlow, Hugging Face transformers, vLLM, TGI
- **OpenAI-Compatible Models**: Support for chat completions, completions, and embeddings endpoints
- **Traffic Management**: Canary deployments, A/B testing, and blue-green deployment patterns

## Key Directories

### Configuration Structure
- `configs/envoy-gateway/` - AI Gateway configurations (GatewayClass, HTTPRoute, Security Policies, Rate Limiting)
- `configs/envoy-gateway/` - AI Gateway configurations (GatewayClass, HTTPRoute, Security Policies, Rate Limiting, EnvoyExtensionPolicy)
- `configs/istio/` - Service mesh policies, authorization rules, and routing configurations
- `configs/kserve/models/` - Model deployment specifications for various ML frameworks
- `configs/auth/` - JWT server deployment and authentication configuration
- `configs/management/` - Management service deployment configuration
- `configs/observability/` - Grafana dashboards and monitoring configuration

### Root Configuration Files
- `envoydump.json` / `envoydump-latest.json` - Envoy configuration dumps for debugging
- `httproute.correct` - Sample HTTPRoute with URLRewrite and header modification filters

### Scripts Directory
- `scripts/bootstrap.sh` - **Primary deployment script** for complete platform setup
- `scripts/demo.sh` - **Interactive demo runner** with multiple scenarios
Expand All @@ -183,8 +217,11 @@ This platform implements a **dual-gateway architecture** where external traffic
- `management/package.json` - NPM scripts for React UI development
- `management/publishing.go` - Model publishing and management service
- `management/types.go` - Type definitions including PublishConfig and PublishedModel
- `management/test_execution.go` - Test execution service for interactive model testing
- `management/ui/src/components/PublishingForm.js` - React component for model publishing
- `management/ui/src/components/PublishingList.js` - React component for managing published models
- `management/ui/src/components/PublishingList.js` - React component for managing published models
- `management/ui/src/components/InferenceTest.js` - React component for interactive model testing
- `scripts/retest.sh` - Quick restart and port-forward for development

### Examples & Documentation
- `examples/serverless/` - Serverless configuration examples and templates
Expand Down Expand Up @@ -227,4 +264,7 @@ JWT tokens are required for model inference requests. The platform includes a JW
- **Shell-Driven Deployment**: All automation implemented via bash scripts
- **Production Patterns**: Demonstrates enterprise-grade AI/ML deployment practices with security, observability, and multi-tenancy
- **Management Service**: Full-stack application (Go backend + React frontend) for platform administration
- **Dual-Gateway Architecture**: External traffic flows through AI Gateway first, then Istio Gateway
- **Dual-Gateway Architecture**: External traffic flows through AI Gateway first, then Istio Gateway
- **OpenAI Compatibility**: Automatic protocol translation for OpenAI → KServe format
- **Model-Aware Routing**: Use `x-ai-eg-model` header for efficient model selection
- **Token-Based Rate Limiting**: LLM models support token-based rate limiting alongside request-based limits
181 changes: 181 additions & 0 deletions GOALS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Goals and Vision

## 🎯 Project Mission

**Inference-in-a-Box** aims to demonstrate and provide a production-ready, enterprise-grade AI/ML inference platform that showcases modern cloud-native deployment patterns, best practices, and comprehensive observability for AI workloads.

## 🚀 Primary Goals

### 1. **Production-Ready AI Infrastructure Demonstration**
- Showcase how to deploy AI/ML models at scale using cloud-native technologies
- Demonstrate enterprise-grade patterns for model serving, security, and observability
- Provide a reference architecture for AI infrastructure teams

### 2. **Educational Platform**
- Serve as a learning resource for platform engineers, DevOps teams, and AI practitioners
- Demonstrate the integration of multiple cloud-native technologies in a cohesive AI platform
- Provide hands-on examples of AI/ML deployment challenges and solutions

### 3. **Technology Integration Showcase**
- Demonstrate how modern cloud-native tools work together for AI workloads
- Show real-world integration patterns between service mesh, gateways, and AI serving frameworks
- Provide examples of advanced networking, security, and observability for AI systems

## 🏗️ Target State Architecture

### Core Technology Stack
- **Kubernetes**: Container orchestration and workload management
- **Istio Service Mesh**: Zero-trust networking, mTLS, and traffic management
- **Envoy AI Gateway**: AI-specific routing, protocol translation, and request handling
- **KServe**: Kubernetes-native serverless model serving with auto-scaling
- **Knative**: Serverless framework enabling scale-to-zero capabilities
- **Prometheus + Grafana**: Comprehensive monitoring and observability

### Key Architectural Patterns

#### **Dual-Gateway Design**
```
External Traffic → Envoy AI Gateway → Istio Gateway → KServe Models
(Tier-1) (Tier-2) (Serving)
```
- **Tier-1 (AI Gateway)**: AI-specific routing, JWT authentication, OpenAI protocol translation
- **Tier-2 (Service Mesh)**: mTLS encryption, traffic policies, service discovery

#### **Multi-Tenant Architecture**
- Complete namespace isolation (`tenant-a`, `tenant-b`, `tenant-c`)
- Separate resource quotas, policies, and observability scopes
- Tenant-specific security boundaries with Istio authorization policies

#### **Serverless Model Serving**
- Auto-scaling from zero to handle varying workloads
- Support for multiple ML frameworks (Scikit-learn, PyTorch, TensorFlow, Hugging Face)
- OpenAI-compatible API endpoints for LLM models

## 🎯 Target Capabilities

### **For Platform Engineers**
- **Infrastructure-as-Code**: Complete platform deployment via scripts and configurations
- **Observability**: Comprehensive monitoring, logging, and tracing for AI workloads
- **Security**: Zero-trust networking, JWT authentication, and authorization policies
- **Scalability**: Auto-scaling capabilities with performance optimization

### **For AI/ML Engineers**
- **Model Publishing**: Web-based interface for publishing and managing models
- **Multiple Protocols**: Support for traditional KServe and OpenAI-compatible APIs
- **Testing Framework**: Built-in testing capabilities with DNS resolution override
- **Documentation**: Auto-generated API documentation and examples

### **For DevOps Teams**
- **CI/CD Integration**: Automated testing and deployment workflows
- **Monitoring**: Real-time metrics, alerts, and performance dashboards
- **Security**: Comprehensive security policies and compliance patterns
- **Multi-tenancy**: Isolated environments for different teams or applications

## 🌟 Unique Value Propositions

### 1. **Complete End-to-End Solution**
Unlike fragmented tutorials or partial implementations, this project provides a complete, working AI inference platform that demonstrates real-world enterprise patterns.

### 2. **Production Patterns**
- Demonstrates actual production concerns: security, scalability, observability, multi-tenancy
- Shows how to handle edge cases and operational challenges
- Provides troubleshooting guides and best practices

### 3. **OpenAI Compatibility**
- Seamless integration with OpenAI client libraries
- Protocol translation from OpenAI format to KServe format
- Support for chat completions, embeddings, and model listing endpoints

### 4. **Advanced Networking**
- Sophisticated traffic management with canary deployments and A/B testing
- Advanced DNS resolution capabilities for testing scenarios
- Custom routing based on model types and tenant requirements

## 🎯 Success Metrics

### **User Experience Metrics**
- **Ease of Deployment**: One-command bootstrap process
- **Documentation Quality**: Complete setup and usage documentation
- **Developer Experience**: Intuitive web interface, comprehensive testing tools
- **Learning Value**: Clear architectural patterns and implementation examples

## 🚧 Current Status vs Target State

### ✅ **Achieved**
- Complete dual-gateway architecture implementation
- Multi-tenant namespace isolation and security policies
- OpenAI-compatible API with protocol translation
- Comprehensive observability stack (Prometheus, Grafana, Kiali, Jaeger)
- Web-based management interface with model publishing
- Advanced testing capabilities with DNS resolution override
- Auto-scaling model serving with KServe and Knative
- Security implementation with JWT authentication and Istio policies

### 🔄 **In Progress**
- Enhanced model lifecycle management
- Advanced rate limiting and quota management
- Expanded model framework support
- Performance optimization and tuning

### 🎯 **Future Roadmap**
- **Advanced AI Features**: Model versioning, A/B testing, canary deployments
- **Enhanced Observability**: AI-specific metrics, model performance tracking
- **Extended Protocols**: Support for additional AI protocols and frameworks
- **Enterprise Features**: RBAC, audit logging, compliance reporting
- **Multi-Cloud**: Deployment patterns for AWS, GCP, Azure
- **Edge Computing**: Edge deployment scenarios and patterns

## 🎓 Learning Outcomes

By exploring and deploying this platform, users will gain practical experience with:

### **Kubernetes Ecosystem**
- Advanced Kubernetes patterns for AI workloads
- Service mesh implementation and configuration
- Gateway and ingress management
- Custom resource definitions and operators

### **AI/ML Operations**
- Model serving and lifecycle management
- Auto-scaling strategies for AI workloads
- Performance monitoring and optimization
- Protocol translation and API gateway patterns

### **Cloud-Native Security**
- Zero-trust networking implementation
- JWT-based authentication and authorization
- mTLS configuration and certificate management
- Multi-tenant security boundaries

### **Observability and Operations**
- Comprehensive monitoring setup for AI systems
- Distributed tracing for request flows
- Performance metrics and alerting
- Troubleshooting and debugging techniques

## 🤝 Community and Contribution

### **Target Audience**
- **Platform Engineers** building AI infrastructure
- **DevOps Engineers** managing AI/ML workloads
- **AI/ML Engineers** deploying models at scale
- **Students and Educators** learning cloud-native AI patterns

### **Contribution Areas**
- Additional model framework integrations
- Enhanced security patterns and policies
- Performance optimization and benchmarking
- Documentation and tutorial improvements
- Testing framework enhancements

## 📈 Strategic Impact

This project serves as a bridge between theoretical cloud-native AI concepts and practical, production-ready implementations. It accelerates AI platform adoption by providing:

1. **Proven Patterns**: Battle-tested architectural patterns and configurations
2. **Reduced Risk**: Validated technology integrations and security models
3. **Faster Time-to-Market**: Complete reference implementation reducing development time
4. **Knowledge Transfer**: Comprehensive documentation and examples for team learning
5. **Operational Excellence**: Built-in observability, monitoring, and troubleshooting capabilities

By providing this comprehensive platform, we enable organizations to focus on their AI/ML applications rather than infrastructure complexity, ultimately accelerating AI adoption and innovation across the industry.
Loading
Loading