Skip to content

Htunn/OmniInference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmniInference

Enterprise-grade, cloud-native AI gateway, abstraction layer, and observability proxy. OmniInference sits between application code and upstream LLM providers — routing, authenticating, observing, and failing over inference calls without any vendor SDK touching your application.

Six Architectural Pillars

# Pillar Mechanism
1 Provider Agnostic Apps call a single unified API; core.Provider interface is the only contract
2 Comprehensive Observability Every call emits a structured JSON log — tokens, latency, model version, hashed input
3 Decoupled Auth Bearer-token enforcement at the gateway edge; consuming apps never handle credentials
4 Explicit Data Paths Wire ↔ internal translation happens in one place (providers/openai); API keys are never logged
5 Built-in Resiliency Named route chains with automatic failover on rate-limit / timeout / 5xx errors
6 Cost Transparency Per-request omni_metadata (team, feature, env) propagated through all telemetry

Component Diagram

graph TB
    subgraph Clients
        APP[Application / SDK]
    end

    subgraph OmniInference Gateway
        direction TB
        MW_RID[Middleware: RequestID]
        MW_AUTH[Middleware: Auth]
        MW_OBS[Middleware: ObservabilityLog]
        HANDLER[Handler: /v1/chat/completions]
        ROUTER[Router + Fallback Engine]

        MW_RID --> MW_AUTH --> MW_OBS --> HANDLER --> ROUTER
    end

    subgraph Core Domain
        TYPES[core/types.go<br/>InferenceRequest / Response]
        IFACE[core/provider.go<br/>Provider interface]
        ERRS[core/errors.go<br/>ProviderError taxonomy]
    end

    subgraph Provider Registry
        REG[providers/Registry]
        OAI[providers/openai<br/>OpenAI · Azure OAI · vLLM]
        FUTURE[providers/...<br/>Bedrock · Vertex · Anthropic]
    end

    subgraph Observability
        HASH[internal/observability<br/>SHA-256 input hash]
        LOGS[Structured JSON Logs<br/>slog — stdout]
    end

    subgraph Upstream LLMs
        LLM1[OpenAI / Azure OpenAI]
        LLM2[Local vLLM]
        LLM3[AWS Bedrock ·<br/>Vertex AI · Anthropic]
    end

    APP -->|POST /v1/chat/completions<br/>Bearer token| MW_RID
    ROUTER --> REG
    REG --> OAI
    REG -.->|future| FUTURE
    OAI --> LLM1
    OAI --> LLM2
    FUTURE -.-> LLM3
    HANDLER --> HASH
    HASH --> LOGS
    MW_OBS --> LOGS

    TYPES -.-> HANDLER
    IFACE -.-> OAI
    ERRS -.-> ROUTER
Loading

Request Sequence Diagram

The sequence below shows a request that hits a rate-limit on the primary provider and automatically fails over to a secondary provider.

sequenceDiagram
    autonumber
    actor Client
    participant GW as Gateway<br/>(HTTP Server)
    participant MW_RID as Middleware<br/>RequestID
    participant MW_AUTH as Middleware<br/>Auth
    participant MW_OBS as Middleware<br/>ObsLog
    participant Handler as Handler<br/>/v1/chat/completions
    participant Router as Router<br/>Fallback Engine
    participant Hash as Observability<br/>SHA-256 Hash
    participant P1 as Provider<br/>azure-openai-east
    participant P2 as Provider<br/>azure-openai-west

    Client->>GW: POST /v1/chat/completions<br/>Authorization: Bearer <token>

    GW->>MW_RID: forward request
    MW_RID->>MW_RID: generate / reuse X-Request-ID
    MW_RID-->>GW: inject ID into context + response header

    GW->>MW_AUTH: forward request
    MW_AUTH->>MW_AUTH: validate Bearer token
    alt Invalid token
        MW_AUTH-->>Client: 401 Unauthorized
    end

    GW->>MW_OBS: forward request
    MW_OBS->>MW_OBS: record start time

    MW_OBS->>Handler: forward request

    Handler->>Handler: decode JSON body<br/>build InferenceRequest
    Handler->>Hash: HashMessages(req.Messages)
    Hash-->>Handler: SHA-256 hex digest<br/>(no raw PII stored)

    Handler->>Router: Complete(ctx, req, omni_route, omni_provider)

    Router->>Router: resolve provider chain<br/>["azure-openai-east", "azure-openai-west"]

    Router->>P1: Complete(ctx, req) — attempt 1
    P1-->>Router: ProviderError{Kind: rate_limit, HTTP: 429}

    Note over Router: ErrKind.IsRetryable() == true<br/>advance to next provider

    Router->>P2: Complete(ctx, req) — attempt 2 (fallback)
    P2-->>Router: InferenceResponse{choices, usage, telemetry}

    Router->>Router: set FallbackOccurred=true<br/>RoutedProvider="azure-openai-west"
    Router-->>Handler: InferenceResponse

    Handler->>Handler: stamp GatewayLatency + InputHash
    Handler->>MW_OBS: emit InferenceLog<br/>{request_id, routed_provider, input_hash,<br/>prompt_tokens, completion_tokens,<br/>provider_latency, gateway_latency,<br/>fallback_occurred, metadata}

    Handler-->>MW_OBS: 200 OK + JSON response body
    MW_OBS->>MW_OBS: emit request log<br/>{method, path, status, gateway_latency}
    MW_OBS-->>Client: 200 OK + InferenceResponse<br/>X-Request-ID: <id>
Loading

Project Layout

OmniInference/
├── cmd/
│   └── omniinference/
│       └── main.go              # Entrypoint — reads env config, starts server
├── core/
│   ├── types.go                 # InferenceRequest, InferenceResponse, ModelRef, Usage, Telemetry
│   ├── provider.go              # Provider interface + InferenceStreamChunk
│   └── errors.go                # ProviderError, ErrKind taxonomy
├── providers/
│   ├── registry.go              # Registry factory (NewRegistry, NewRegistryFromProviders)
│   └── openai/
│       └── adapter.go           # OpenAI-compatible adapter (OpenAI · Azure OAI · vLLM)
├── gateway/
│   ├── config.go                # Config, ConfigFromEnv, MarshalSafe (redacts secrets)
│   ├── server.go                # Server — HTTP lifecycle + middleware chain assembly
│   ├── handler.go               # /v1/chat/completions handler
│   ├── middleware/
│   │   ├── request_id.go        # X-Request-ID injection
│   │   ├── auth.go              # Bearer-token gate (Pillar 3)
│   │   └── observability.go     # Structured slog request + inference telemetry
│   └── router/
│       └── router.go            # Router — named route chains, automatic failover (Pillar 5)
└── internal/
    └── observability/
        └── hash.go              # SHA-256 input hashing (Pillar 4)

Quick Start

Environment Variables

Variable Required Description
OMNI_OPENAI_API_KEY yes Bearer token for the OpenAI-compatible provider
OMNI_OPENAI_BASE_URL no Override endpoint (Azure OAI, vLLM, etc.) Default: https://api.openai.com/v1
OMNI_AUTH_TOKENS no Comma-separated list of valid client bearer tokens. Empty = auth disabled
OMNI_DEFAULT_PROVIDER no Provider name to use when no route is specified. Default: openai
OMNI_PORT no Listen port. Default: 8080
OMNI_LOG_LEVEL no debug | info | warn | error. Default: info

Run Locally

export OMNI_OPENAI_API_KEY=sk-...
export OMNI_AUTH_TOKENS=my-local-token
go run ./cmd/omniinference

Send a Request

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer my-local-token" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}],
    "omni_metadata": {"team": "platform", "feature": "chat"}
  }'

Route with Automatic Failover

Use omni_route to select a named route chain configured on the gateway:

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer my-local-token" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}],
    "omni_route": "primary-chain",
    "omni_metadata": {"team": "platform", "feature": "chat"}
  }'

Pin to a Specific Provider

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer my-local-token" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}],
    "omni_provider": "vllm-local"
  }'

Health Check

curl http://localhost:8080/healthz
# {"status":"ok"}

Observability Output

Every inference call emits two structured JSON log lines to stdout.

Request log (emitted by ObservabilityLog middleware):

{
  "time": "2026-06-25T12:00:00Z",
  "level": "INFO",
  "msg": "request",
  "request_id": "a3f1...",
  "method": "POST",
  "path": "/v1/chat/completions",
  "status": 200,
  "response_bytes": 512,
  "gateway_latency": 423000000
}

Inference log (emitted by the handler after dispatch):

{
  "time": "2026-06-25T12:00:00Z",
  "level": "INFO",
  "msg": "inference",
  "request_id": "a3f1...",
  "routed_provider": "azure-openai-west",
  "input_hash": "e3b0c44298fc1c14...",
  "prompt_tokens": 42,
  "completion_tokens": 87,
  "total_tokens": 129,
  "provider_latency": 380000000,
  "gateway_latency": 423000000,
  "fallback_occurred": true,
  "metadata": "{\"team\":\"platform\",\"feature\":\"chat\"}"
}

Running Tests

go test ./...
go vet ./...

Roadmap

  • Additional provider adapters: AWS Bedrock, Vertex AI, Anthropic
  • Retry backoff with jitter (exponential, configurable per route)
  • Per-key / per-team rate limiting
  • Streaming SSE passthrough (/v1/chat/completions with stream: true)
  • YAML config file loader (supplement env-var config)
  • Prometheus metrics exporter
  • DB-backed audit log persistence
  • Admin API: live route config reload, provider health status

About

Enterprise-grade, cloud-native AI gateway, abstraction layer, and observability proxy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages