Skip to content

Commit 1106ade

Browse files
authored
Merge pull request smarunich#15 from smarunich/wave10
wave10
2 parents 297c441 + aca301b commit 1106ade

34 files changed

+5689
-1097
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
./CLAUDE.md
2+
istio-*/

CLAUDE.md

Lines changed: 53 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
# CLAUDE.md
22

3+
> **📋 Navigation:** [🏠 Main README](README.md)[🎯 Goals & Vision](GOALS.md)[🚀 Getting Started](docs/getting-started.md)[📖 Usage Guide](docs/usage.md)[🏗️ Architecture](docs/architecture.md)
4+
35
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
46

7+
> **🎯 Project Context:** This project demonstrates enterprise-grade AI/ML inference patterns. See [GOALS.md](GOALS.md) for complete vision and objectives.
8+
59
## Project Overview
610

711
**Inference-in-a-Box** is a comprehensive Kubernetes-based AI/ML inference platform demonstration showcasing enterprise-grade model serving using cloud-native technologies. It's an infrastructure-as-code project demonstrating production-ready AI/ML deployment patterns with Envoy AI Gateway, Istio service mesh, KServe, and comprehensive observability.
@@ -72,6 +76,12 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
7276
-d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box"}}' \
7377
http://localhost:8085/api/models/my-model/publish
7478

79+
# Publish OpenAI-compatible model with token-based rate limiting
80+
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
81+
-H "Content-Type: application/json" \
82+
-d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box", "modelType": "openai", "rateLimiting": {"tokensPerHour": 100000}}}' \
83+
http://localhost:8085/api/models/llama-3-8b/publish
84+
7585
# Update published model configuration
7686
curl -X PUT -H "Authorization: Bearer $ADMIN_TOKEN" \
7787
-H "Content-Type: application/json" \
@@ -101,6 +111,19 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
101111

102112
# Get JWT tokens for testing
103113
./scripts/get-jwt-tokens.sh
114+
115+
# Test OpenAI-compatible model
116+
export AI_GATEWAY_URL="http://localhost:8080"
117+
export JWT_TOKEN="<your-jwt-token>"
118+
119+
# Chat completion request
120+
curl -H "Authorization: Bearer $JWT_TOKEN" \
121+
-H "x-ai-eg-model: llama-3-8b" \
122+
$AI_GATEWAY_URL/v1/chat/completions \
123+
-d '{
124+
"model": "llama-3-8b",
125+
"messages": [{"role": "user", "content": "Hello!"}]
126+
}'
104127
```
105128

106129
### Build & Container Management
@@ -120,14 +143,15 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
120143
# Management Service UI & API
121144
kubectl port-forward svc/management-service 8085:80
122145

123-
# Observability Stack
124-
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
125-
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
126-
kubectl port-forward -n monitoring svc/kiali 20001:20001
146+
# Observability Stack (see docs/usage.md for complete service access)
147+
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 # Grafana
148+
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 # Prometheus
149+
kubectl port-forward -n monitoring svc/kiali 20001:20001 # Kiali
127150

128-
# AI Gateway & Auth
129-
kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80
130-
kubectl port-forward -n default svc/jwt-server 8081:8080
151+
# Service Access (see docs/usage.md for complete reference)
152+
kubectl port-forward -n envoy-gateway-system svc/envoy-ai-gateway 8080:80 # AI Gateway
153+
kubectl port-forward svc/management-service 8085:80 # Management UI/API
154+
kubectl port-forward -n default svc/jwt-server 8081:8080 # JWT Server
131155
```
132156

133157
## Architecture
@@ -139,14 +163,19 @@ This platform implements a **dual-gateway architecture** where external traffic
139163

140164
### Technology Stack Integration
141165
- **Kind Cluster**: Local Kubernetes cluster (`inference-in-a-box`)
142-
- **Envoy AI Gateway**: AI-specific gateway with JWT validation and model routing
166+
- **Envoy AI Gateway**: AI-specific gateway with JWT validation, model routing, and OpenAI API compatibility
167+
- **EnvoyExtensionPolicy**: External processor configuration for AI-specific routing
168+
- **Model-aware routing**: Using x-ai-eg-model header for efficient model selection
169+
- **Protocol translation**: OpenAI to KServe format conversion
143170
- **Istio Service Mesh**: Zero-trust networking with automatic mTLS between services
144171
- **KServe**: Kubernetes-native serverless model serving with auto-scaling
145172
- **Knative**: Serverless framework enabling scale-to-zero capabilities
146173
- **Management Service**: Go backend with embedded React frontend for platform administration
147174
- **Model Publishing**: Full-featured model publishing and management system
148175
- **Public Hostname Configuration**: Configurable external access via `api.router.inference-in-a-box`
149-
- **Rate Limiting**: Per-model rate limiting with configurable limits
176+
- **Rate Limiting**: Per-model rate limiting with configurable limits (requests and tokens)
177+
- **OpenAI Compatibility**: Automatic detection and configuration for LLM models
178+
- **Model Testing**: Interactive inference testing with support for both traditional and OpenAI formats
150179

151180
### Multi-Tenant Architecture
152181
- **Tenant Namespaces**: `tenant-a`, `tenant-b`, `tenant-c` with complete resource isolation
@@ -155,19 +184,24 @@ This platform implements a **dual-gateway architecture** where external traffic
155184

156185
### Serverless Model Serving
157186
- **KServe InferenceServices**: Auto-scaling model endpoints with scale-to-zero capabilities
158-
- **Supported Frameworks**: Scikit-learn, PyTorch, TensorFlow, Hugging Face transformers
187+
- **Supported Frameworks**: Scikit-learn, PyTorch, TensorFlow, Hugging Face transformers, vLLM, TGI
188+
- **OpenAI-Compatible Models**: Support for chat completions, completions, and embeddings endpoints
159189
- **Traffic Management**: Canary deployments, A/B testing, and blue-green deployment patterns
160190

161191
## Key Directories
162192

163193
### Configuration Structure
164-
- `configs/envoy-gateway/` - AI Gateway configurations (GatewayClass, HTTPRoute, Security Policies, Rate Limiting)
194+
- `configs/envoy-gateway/` - AI Gateway configurations (GatewayClass, HTTPRoute, Security Policies, Rate Limiting, EnvoyExtensionPolicy)
165195
- `configs/istio/` - Service mesh policies, authorization rules, and routing configurations
166196
- `configs/kserve/models/` - Model deployment specifications for various ML frameworks
167197
- `configs/auth/` - JWT server deployment and authentication configuration
168198
- `configs/management/` - Management service deployment configuration
169199
- `configs/observability/` - Grafana dashboards and monitoring configuration
170200

201+
### Root Configuration Files
202+
- `envoydump.json` / `envoydump-latest.json` - Envoy configuration dumps for debugging
203+
- `httproute.correct` - Sample HTTPRoute with URLRewrite and header modification filters
204+
171205
### Scripts Directory
172206
- `scripts/bootstrap.sh` - **Primary deployment script** for complete platform setup
173207
- `scripts/demo.sh` - **Interactive demo runner** with multiple scenarios
@@ -183,8 +217,11 @@ This platform implements a **dual-gateway architecture** where external traffic
183217
- `management/package.json` - NPM scripts for React UI development
184218
- `management/publishing.go` - Model publishing and management service
185219
- `management/types.go` - Type definitions including PublishConfig and PublishedModel
220+
- `management/test_execution.go` - Test execution service for interactive model testing
186221
- `management/ui/src/components/PublishingForm.js` - React component for model publishing
187-
- `management/ui/src/components/PublishingList.js` - React component for managing published models
222+
- `management/ui/src/components/PublishingList.js` - React component for managing published models
223+
- `management/ui/src/components/InferenceTest.js` - React component for interactive model testing
224+
- `scripts/retest.sh` - Quick restart and port-forward for development
188225

189226
### Examples & Documentation
190227
- `examples/serverless/` - Serverless configuration examples and templates
@@ -227,4 +264,7 @@ JWT tokens are required for model inference requests. The platform includes a JW
227264
- **Shell-Driven Deployment**: All automation implemented via bash scripts
228265
- **Production Patterns**: Demonstrates enterprise-grade AI/ML deployment practices with security, observability, and multi-tenancy
229266
- **Management Service**: Full-stack application (Go backend + React frontend) for platform administration
230-
- **Dual-Gateway Architecture**: External traffic flows through AI Gateway first, then Istio Gateway
267+
- **Dual-Gateway Architecture**: External traffic flows through AI Gateway first, then Istio Gateway
268+
- **OpenAI Compatibility**: Automatic protocol translation for OpenAI → KServe format
269+
- **Model-Aware Routing**: Use `x-ai-eg-model` header for efficient model selection
270+
- **Token-Based Rate Limiting**: LLM models support token-based rate limiting alongside request-based limits

GOALS.md

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Goals and Vision
2+
3+
## 🎯 Project Mission
4+
5+
**Inference-in-a-Box** aims to demonstrate and provide a production-ready, enterprise-grade AI/ML inference platform that showcases modern cloud-native deployment patterns, best practices, and comprehensive observability for AI workloads.
6+
7+
## 🚀 Primary Goals
8+
9+
### 1. **Production-Ready AI Infrastructure Demonstration**
10+
- Showcase how to deploy AI/ML models at scale using cloud-native technologies
11+
- Demonstrate enterprise-grade patterns for model serving, security, and observability
12+
- Provide a reference architecture for AI infrastructure teams
13+
14+
### 2. **Educational Platform**
15+
- Serve as a learning resource for platform engineers, DevOps teams, and AI practitioners
16+
- Demonstrate the integration of multiple cloud-native technologies in a cohesive AI platform
17+
- Provide hands-on examples of AI/ML deployment challenges and solutions
18+
19+
### 3. **Technology Integration Showcase**
20+
- Demonstrate how modern cloud-native tools work together for AI workloads
21+
- Show real-world integration patterns between service mesh, gateways, and AI serving frameworks
22+
- Provide examples of advanced networking, security, and observability for AI systems
23+
24+
## 🏗️ Target State Architecture
25+
26+
### Core Technology Stack
27+
- **Kubernetes**: Container orchestration and workload management
28+
- **Istio Service Mesh**: Zero-trust networking, mTLS, and traffic management
29+
- **Envoy AI Gateway**: AI-specific routing, protocol translation, and request handling
30+
- **KServe**: Kubernetes-native serverless model serving with auto-scaling
31+
- **Knative**: Serverless framework enabling scale-to-zero capabilities
32+
- **Prometheus + Grafana**: Comprehensive monitoring and observability
33+
34+
### Key Architectural Patterns
35+
36+
#### **Dual-Gateway Design**
37+
```
38+
External Traffic → Envoy AI Gateway → Istio Gateway → KServe Models
39+
(Tier-1) (Tier-2) (Serving)
40+
```
41+
- **Tier-1 (AI Gateway)**: AI-specific routing, JWT authentication, OpenAI protocol translation
42+
- **Tier-2 (Service Mesh)**: mTLS encryption, traffic policies, service discovery
43+
44+
#### **Multi-Tenant Architecture**
45+
- Complete namespace isolation (`tenant-a`, `tenant-b`, `tenant-c`)
46+
- Separate resource quotas, policies, and observability scopes
47+
- Tenant-specific security boundaries with Istio authorization policies
48+
49+
#### **Serverless Model Serving**
50+
- Auto-scaling from zero to handle varying workloads
51+
- Support for multiple ML frameworks (Scikit-learn, PyTorch, TensorFlow, Hugging Face)
52+
- OpenAI-compatible API endpoints for LLM models
53+
54+
## 🎯 Target Capabilities
55+
56+
### **For Platform Engineers**
57+
- **Infrastructure-as-Code**: Complete platform deployment via scripts and configurations
58+
- **Observability**: Comprehensive monitoring, logging, and tracing for AI workloads
59+
- **Security**: Zero-trust networking, JWT authentication, and authorization policies
60+
- **Scalability**: Auto-scaling capabilities with performance optimization
61+
62+
### **For AI/ML Engineers**
63+
- **Model Publishing**: Web-based interface for publishing and managing models
64+
- **Multiple Protocols**: Support for traditional KServe and OpenAI-compatible APIs
65+
- **Testing Framework**: Built-in testing capabilities with DNS resolution override
66+
- **Documentation**: Auto-generated API documentation and examples
67+
68+
### **For DevOps Teams**
69+
- **CI/CD Integration**: Automated testing and deployment workflows
70+
- **Monitoring**: Real-time metrics, alerts, and performance dashboards
71+
- **Security**: Comprehensive security policies and compliance patterns
72+
- **Multi-tenancy**: Isolated environments for different teams or applications
73+
74+
## 🌟 Unique Value Propositions
75+
76+
### 1. **Complete End-to-End Solution**
77+
Unlike fragmented tutorials or partial implementations, this project provides a complete, working AI inference platform that demonstrates real-world enterprise patterns.
78+
79+
### 2. **Production Patterns**
80+
- Demonstrates actual production concerns: security, scalability, observability, multi-tenancy
81+
- Shows how to handle edge cases and operational challenges
82+
- Provides troubleshooting guides and best practices
83+
84+
### 3. **OpenAI Compatibility**
85+
- Seamless integration with OpenAI client libraries
86+
- Protocol translation from OpenAI format to KServe format
87+
- Support for chat completions, embeddings, and model listing endpoints
88+
89+
### 4. **Advanced Networking**
90+
- Sophisticated traffic management with canary deployments and A/B testing
91+
- Advanced DNS resolution capabilities for testing scenarios
92+
- Custom routing based on model types and tenant requirements
93+
94+
## 🎯 Success Metrics
95+
96+
### **User Experience Metrics**
97+
- **Ease of Deployment**: One-command bootstrap process
98+
- **Documentation Quality**: Complete setup and usage documentation
99+
- **Developer Experience**: Intuitive web interface, comprehensive testing tools
100+
- **Learning Value**: Clear architectural patterns and implementation examples
101+
102+
## 🚧 Current Status vs Target State
103+
104+
### **Achieved**
105+
- Complete dual-gateway architecture implementation
106+
- Multi-tenant namespace isolation and security policies
107+
- OpenAI-compatible API with protocol translation
108+
- Comprehensive observability stack (Prometheus, Grafana, Kiali, Jaeger)
109+
- Web-based management interface with model publishing
110+
- Advanced testing capabilities with DNS resolution override
111+
- Auto-scaling model serving with KServe and Knative
112+
- Security implementation with JWT authentication and Istio policies
113+
114+
### 🔄 **In Progress**
115+
- Enhanced model lifecycle management
116+
- Advanced rate limiting and quota management
117+
- Expanded model framework support
118+
- Performance optimization and tuning
119+
120+
### 🎯 **Future Roadmap**
121+
- **Advanced AI Features**: Model versioning, A/B testing, canary deployments
122+
- **Enhanced Observability**: AI-specific metrics, model performance tracking
123+
- **Extended Protocols**: Support for additional AI protocols and frameworks
124+
- **Enterprise Features**: RBAC, audit logging, compliance reporting
125+
- **Multi-Cloud**: Deployment patterns for AWS, GCP, Azure
126+
- **Edge Computing**: Edge deployment scenarios and patterns
127+
128+
## 🎓 Learning Outcomes
129+
130+
By exploring and deploying this platform, users will gain practical experience with:
131+
132+
### **Kubernetes Ecosystem**
133+
- Advanced Kubernetes patterns for AI workloads
134+
- Service mesh implementation and configuration
135+
- Gateway and ingress management
136+
- Custom resource definitions and operators
137+
138+
### **AI/ML Operations**
139+
- Model serving and lifecycle management
140+
- Auto-scaling strategies for AI workloads
141+
- Performance monitoring and optimization
142+
- Protocol translation and API gateway patterns
143+
144+
### **Cloud-Native Security**
145+
- Zero-trust networking implementation
146+
- JWT-based authentication and authorization
147+
- mTLS configuration and certificate management
148+
- Multi-tenant security boundaries
149+
150+
### **Observability and Operations**
151+
- Comprehensive monitoring setup for AI systems
152+
- Distributed tracing for request flows
153+
- Performance metrics and alerting
154+
- Troubleshooting and debugging techniques
155+
156+
## 🤝 Community and Contribution
157+
158+
### **Target Audience**
159+
- **Platform Engineers** building AI infrastructure
160+
- **DevOps Engineers** managing AI/ML workloads
161+
- **AI/ML Engineers** deploying models at scale
162+
- **Students and Educators** learning cloud-native AI patterns
163+
164+
### **Contribution Areas**
165+
- Additional model framework integrations
166+
- Enhanced security patterns and policies
167+
- Performance optimization and benchmarking
168+
- Documentation and tutorial improvements
169+
- Testing framework enhancements
170+
171+
## 📈 Strategic Impact
172+
173+
This project serves as a bridge between theoretical cloud-native AI concepts and practical, production-ready implementations. It accelerates AI platform adoption by providing:
174+
175+
1. **Proven Patterns**: Battle-tested architectural patterns and configurations
176+
2. **Reduced Risk**: Validated technology integrations and security models
177+
3. **Faster Time-to-Market**: Complete reference implementation reducing development time
178+
4. **Knowledge Transfer**: Comprehensive documentation and examples for team learning
179+
5. **Operational Excellence**: Built-in observability, monitoring, and troubleshooting capabilities
180+
181+
By providing this comprehensive platform, we enable organizations to focus on their AI/ML applications rather than infrastructure complexity, ultimately accelerating AI adoption and innovation across the industry.

0 commit comments

Comments
 (0)