smarunich · smarunich · Jul 23, 2025 · Jul 17, 2025 · Jul 18, 2025 · Jul 21, 2025
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
 ./CLAUDE.md
+istio-*/
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,7 +1,11 @@
 # CLAUDE.md
 
+> **📋 Navigation:** [🏠 Main README](README.md) • [🎯 Goals & Vision](GOALS.md) • [🚀 Getting Started](docs/getting-started.md) • [📖 Usage Guide](docs/usage.md) • [🏗️ Architecture](docs/architecture.md)
+
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 
+> **🎯 Project Context:** This project demonstrates enterprise-grade AI/ML inference patterns. See [GOALS.md](GOALS.md) for complete vision and objectives.
+
 ## Project Overview
 
 **Inference-in-a-Box** is a comprehensive Kubernetes-based AI/ML inference platform demonstration showcasing enterprise-grade model serving using cloud-native technologies. It's an infrastructure-as-code project demonstrating production-ready AI/ML deployment patterns with Envoy AI Gateway, Istio service mesh, KServe, and comprehensive observability.
@@ -72,6 +76,12 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
   -d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box"}}' \
   http://localhost:8085/api/models/my-model/publish
 
+# Publish OpenAI-compatible model with token-based rate limiting
+curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box", "modelType": "openai", "rateLimiting": {"tokensPerHour": 100000}}}' \
+  http://localhost:8085/api/models/llama-3-8b/publish
+
 # Update published model configuration
 curl -X PUT -H "Authorization: Bearer $ADMIN_TOKEN" \
   -H "Content-Type: application/json" \
@@ -101,6 +111,19 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
 
 # Get JWT tokens for testing
 ./scripts/get-jwt-tokens.sh
+
+# Test OpenAI-compatible model
+export AI_GATEWAY_URL="http://localhost:8080"
+export JWT_TOKEN="<your-jwt-token>"
+
+# Chat completion request
+curl -H "Authorization: Bearer $JWT_TOKEN" \
+     -H "x-ai-eg-model: llama-3-8b" \
+     $AI_GATEWAY_URL/v1/chat/completions \
+     -d '{
+       "model": "llama-3-8b",
+       "messages": [{"role": "user", "content": "Hello!"}]
+     }'
 ```
 
 ### Build & Container Management
@@ -120,14 +143,15 @@ curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
 # Management Service UI & API
 kubectl port-forward svc/management-service 8085:80
 
-# Observability Stack
-kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
-kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
-kubectl port-forward -n monitoring svc/kiali 20001:20001
+# Observability Stack (see docs/usage.md for complete service access)  
+kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80             # Grafana
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090  # Prometheus
+kubectl port-forward -n monitoring svc/kiali 20001:20001                      # Kiali
 
-# AI Gateway & Auth
-kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80
-kubectl port-forward -n default svc/jwt-server 8081:8080
+# Service Access (see docs/usage.md for complete reference)
+kubectl port-forward -n envoy-gateway-system svc/envoy-ai-gateway 8080:80     # AI Gateway
+kubectl port-forward svc/management-service 8085:80                           # Management UI/API
+kubectl port-forward -n default svc/jwt-server 8081:8080                      # JWT Server
 ```
 
 ## Architecture
@@ -139,14 +163,19 @@ This platform implements a **dual-gateway architecture** where external traffic
 
 ### Technology Stack Integration
 - **Kind Cluster**: Local Kubernetes cluster (`inference-in-a-box`)
-- **Envoy AI Gateway**: AI-specific gateway with JWT validation and model routing
+- **Envoy AI Gateway**: AI-specific gateway with JWT validation, model routing, and OpenAI API compatibility
+  - **EnvoyExtensionPolicy**: External processor configuration for AI-specific routing
+  - **Model-aware routing**: Using x-ai-eg-model header for efficient model selection
+  - **Protocol translation**: OpenAI to KServe format conversion
 - **Istio Service Mesh**: Zero-trust networking with automatic mTLS between services
 - **KServe**: Kubernetes-native serverless model serving with auto-scaling
 - **Knative**: Serverless framework enabling scale-to-zero capabilities
 - **Management Service**: Go backend with embedded React frontend for platform administration
   - **Model Publishing**: Full-featured model publishing and management system
   - **Public Hostname Configuration**: Configurable external access via `api.router.inference-in-a-box`
-  - **Rate Limiting**: Per-model rate limiting with configurable limits
+  - **Rate Limiting**: Per-model rate limiting with configurable limits (requests and tokens)
+  - **OpenAI Compatibility**: Automatic detection and configuration for LLM models
+  - **Model Testing**: Interactive inference testing with support for both traditional and OpenAI formats
 
 ### Multi-Tenant Architecture
 - **Tenant Namespaces**: `tenant-a`, `tenant-b`, `tenant-c` with complete resource isolation
@@ -155,19 +184,24 @@ This platform implements a **dual-gateway architecture** where external traffic
 
 ### Serverless Model Serving
 - **KServe InferenceServices**: Auto-scaling model endpoints with scale-to-zero capabilities
-- **Supported Frameworks**: Scikit-learn, PyTorch, TensorFlow, Hugging Face transformers
+- **Supported Frameworks**: Scikit-learn, PyTorch, TensorFlow, Hugging Face transformers, vLLM, TGI
+- **OpenAI-Compatible Models**: Support for chat completions, completions, and embeddings endpoints
 - **Traffic Management**: Canary deployments, A/B testing, and blue-green deployment patterns
 
 ## Key Directories
 
 ### Configuration Structure
-- `configs/envoy-gateway/` - AI Gateway configurations (GatewayClass, HTTPRoute, Security Policies, Rate Limiting)
+- `configs/envoy-gateway/` - AI Gateway configurations (GatewayClass, HTTPRoute, Security Policies, Rate Limiting, EnvoyExtensionPolicy)
 - `configs/istio/` - Service mesh policies, authorization rules, and routing configurations
 - `configs/kserve/models/` - Model deployment specifications for various ML frameworks
 - `configs/auth/` - JWT server deployment and authentication configuration
 - `configs/management/` - Management service deployment configuration
 - `configs/observability/` - Grafana dashboards and monitoring configuration
 
+### Root Configuration Files
+- `envoydump.json` / `envoydump-latest.json` - Envoy configuration dumps for debugging
+- `httproute.correct` - Sample HTTPRoute with URLRewrite and header modification filters
+
 ### Scripts Directory
 - `scripts/bootstrap.sh` - **Primary deployment script** for complete platform setup
 - `scripts/demo.sh` - **Interactive demo runner** with multiple scenarios
@@ -183,8 +217,11 @@ This platform implements a **dual-gateway architecture** where external traffic
 - `management/package.json` - NPM scripts for React UI development
 - `management/publishing.go` - Model publishing and management service
 - `management/types.go` - Type definitions including PublishConfig and PublishedModel
+- `management/test_execution.go` - Test execution service for interactive model testing
 - `management/ui/src/components/PublishingForm.js` - React component for model publishing
-- `management/ui/src/components/PublishingList.js` - React component for managing published models
+- `management/ui/src/components/PublishingList.js` - React component for managing published models  
+- `management/ui/src/components/InferenceTest.js` - React component for interactive model testing
+- `scripts/retest.sh` - Quick restart and port-forward for development
 
 ### Examples & Documentation
 - `examples/serverless/` - Serverless configuration examples and templates
@@ -227,4 +264,7 @@ JWT tokens are required for model inference requests. The platform includes a JW
 - **Shell-Driven Deployment**: All automation implemented via bash scripts
 - **Production Patterns**: Demonstrates enterprise-grade AI/ML deployment practices with security, observability, and multi-tenancy
 - **Management Service**: Full-stack application (Go backend + React frontend) for platform administration
-- **Dual-Gateway Architecture**: External traffic flows through AI Gateway first, then Istio Gateway
+- **Dual-Gateway Architecture**: External traffic flows through AI Gateway first, then Istio Gateway
+- **OpenAI Compatibility**: Automatic protocol translation for OpenAI → KServe format
+- **Model-Aware Routing**: Use `x-ai-eg-model` header for efficient model selection
+- **Token-Based Rate Limiting**: LLM models support token-based rate limiting alongside request-based limits
diff --git a/GOALS.md b/GOALS.md
@@ -0,0 +1,181 @@
+# Goals and Vision
+
+## 🎯 Project Mission
+
+**Inference-in-a-Box** aims to demonstrate and provide a production-ready, enterprise-grade AI/ML inference platform that showcases modern cloud-native deployment patterns, best practices, and comprehensive observability for AI workloads.
+
+## 🚀 Primary Goals
+
+### 1. **Production-Ready AI Infrastructure Demonstration**
+- Showcase how to deploy AI/ML models at scale using cloud-native technologies
+- Demonstrate enterprise-grade patterns for model serving, security, and observability
+- Provide a reference architecture for AI infrastructure teams
+
+### 2. **Educational Platform**
+- Serve as a learning resource for platform engineers, DevOps teams, and AI practitioners
+- Demonstrate the integration of multiple cloud-native technologies in a cohesive AI platform
+- Provide hands-on examples of AI/ML deployment challenges and solutions
+
+### 3. **Technology Integration Showcase**
+- Demonstrate how modern cloud-native tools work together for AI workloads
+- Show real-world integration patterns between service mesh, gateways, and AI serving frameworks
+- Provide examples of advanced networking, security, and observability for AI systems
+
+## 🏗️ Target State Architecture
+
+### Core Technology Stack
+- **Kubernetes**: Container orchestration and workload management
+- **Istio Service Mesh**: Zero-trust networking, mTLS, and traffic management
+- **Envoy AI Gateway**: AI-specific routing, protocol translation, and request handling
+- **KServe**: Kubernetes-native serverless model serving with auto-scaling
+- **Knative**: Serverless framework enabling scale-to-zero capabilities
+- **Prometheus + Grafana**: Comprehensive monitoring and observability
+
+### Key Architectural Patterns
+
+#### **Dual-Gateway Design**
+```
+External Traffic → Envoy AI Gateway → Istio Gateway → KServe Models
+     (Tier-1)            (Tier-2)         (Serving)
+```
+- **Tier-1 (AI Gateway)**: AI-specific routing, JWT authentication, OpenAI protocol translation
+- **Tier-2 (Service Mesh)**: mTLS encryption, traffic policies, service discovery
+
+#### **Multi-Tenant Architecture**
+- Complete namespace isolation (`tenant-a`, `tenant-b`, `tenant-c`)
+- Separate resource quotas, policies, and observability scopes
+- Tenant-specific security boundaries with Istio authorization policies
+
+#### **Serverless Model Serving**
+- Auto-scaling from zero to handle varying workloads
+- Support for multiple ML frameworks (Scikit-learn, PyTorch, TensorFlow, Hugging Face)
+- OpenAI-compatible API endpoints for LLM models
+
+## 🎯 Target Capabilities
+
+### **For Platform Engineers**
+- **Infrastructure-as-Code**: Complete platform deployment via scripts and configurations
+- **Observability**: Comprehensive monitoring, logging, and tracing for AI workloads
+- **Security**: Zero-trust networking, JWT authentication, and authorization policies
+- **Scalability**: Auto-scaling capabilities with performance optimization
+
+### **For AI/ML Engineers**
+- **Model Publishing**: Web-based interface for publishing and managing models
+- **Multiple Protocols**: Support for traditional KServe and OpenAI-compatible APIs
+- **Testing Framework**: Built-in testing capabilities with DNS resolution override
+- **Documentation**: Auto-generated API documentation and examples
+
+### **For DevOps Teams**
+- **CI/CD Integration**: Automated testing and deployment workflows
+- **Monitoring**: Real-time metrics, alerts, and performance dashboards
+- **Security**: Comprehensive security policies and compliance patterns
+- **Multi-tenancy**: Isolated environments for different teams or applications
+
+## 🌟 Unique Value Propositions
+
+### 1. **Complete End-to-End Solution**
+Unlike fragmented tutorials or partial implementations, this project provides a complete, working AI inference platform that demonstrates real-world enterprise patterns.
+
+### 2. **Production Patterns**
+- Demonstrates actual production concerns: security, scalability, observability, multi-tenancy
+- Shows how to handle edge cases and operational challenges
+- Provides troubleshooting guides and best practices
+
+### 3. **OpenAI Compatibility**
+- Seamless integration with OpenAI client libraries
+- Protocol translation from OpenAI format to KServe format
+- Support for chat completions, embeddings, and model listing endpoints
+
+### 4. **Advanced Networking**
+- Sophisticated traffic management with canary deployments and A/B testing
+- Advanced DNS resolution capabilities for testing scenarios
+- Custom routing based on model types and tenant requirements
+
+## 🎯 Success Metrics
+
+### **User Experience Metrics**
+- **Ease of Deployment**: One-command bootstrap process
+- **Documentation Quality**: Complete setup and usage documentation
+- **Developer Experience**: Intuitive web interface, comprehensive testing tools
+- **Learning Value**: Clear architectural patterns and implementation examples
+
+## 🚧 Current Status vs Target State
+
+### ✅ **Achieved**
+- Complete dual-gateway architecture implementation
+- Multi-tenant namespace isolation and security policies
+- OpenAI-compatible API with protocol translation
+- Comprehensive observability stack (Prometheus, Grafana, Kiali, Jaeger)
+- Web-based management interface with model publishing
+- Advanced testing capabilities with DNS resolution override
+- Auto-scaling model serving with KServe and Knative
+- Security implementation with JWT authentication and Istio policies
+
+### 🔄 **In Progress**
+- Enhanced model lifecycle management
+- Advanced rate limiting and quota management
+- Expanded model framework support
+- Performance optimization and tuning
+
+### 🎯 **Future Roadmap**
+- **Advanced AI Features**: Model versioning, A/B testing, canary deployments
+- **Enhanced Observability**: AI-specific metrics, model performance tracking
+- **Extended Protocols**: Support for additional AI protocols and frameworks
+- **Enterprise Features**: RBAC, audit logging, compliance reporting
+- **Multi-Cloud**: Deployment patterns for AWS, GCP, Azure
+- **Edge Computing**: Edge deployment scenarios and patterns
+
+## 🎓 Learning Outcomes
+
+By exploring and deploying this platform, users will gain practical experience with:
+
+### **Kubernetes Ecosystem**
+- Advanced Kubernetes patterns for AI workloads
+- Service mesh implementation and configuration
+- Gateway and ingress management
+- Custom resource definitions and operators
+
+### **AI/ML Operations**
+- Model serving and lifecycle management
+- Auto-scaling strategies for AI workloads
+- Performance monitoring and optimization
+- Protocol translation and API gateway patterns
+
+### **Cloud-Native Security**
+- Zero-trust networking implementation
+- JWT-based authentication and authorization
+- mTLS configuration and certificate management
+- Multi-tenant security boundaries
+
+### **Observability and Operations**
+- Comprehensive monitoring setup for AI systems
+- Distributed tracing for request flows
+- Performance metrics and alerting
+- Troubleshooting and debugging techniques
+
+## 🤝 Community and Contribution
+
+### **Target Audience**
+- **Platform Engineers** building AI infrastructure
+- **DevOps Engineers** managing AI/ML workloads
+- **AI/ML Engineers** deploying models at scale
+- **Students and Educators** learning cloud-native AI patterns
+
+### **Contribution Areas**
+- Additional model framework integrations
+- Enhanced security patterns and policies
+- Performance optimization and benchmarking
+- Documentation and tutorial improvements
+- Testing framework enhancements
+
+## 📈 Strategic Impact
+
+This project serves as a bridge between theoretical cloud-native AI concepts and practical, production-ready implementations. It accelerates AI platform adoption by providing:
+
+1. **Proven Patterns**: Battle-tested architectural patterns and configurations
+2. **Reduced Risk**: Validated technology integrations and security models
+3. **Faster Time-to-Market**: Complete reference implementation reducing development time
+4. **Knowledge Transfer**: Comprehensive documentation and examples for team learning
+5. **Operational Excellence**: Built-in observability, monitoring, and troubleshooting capabilities
+
+By providing this comprehensive platform, we enable organizations to focus on their AI/ML applications rather than infrastructure complexity, ultimately accelerating AI adoption and innovation across the industry.