Enterprise-grade, cloud-native AI gateway, abstraction layer, and observability proxy. OmniInference sits between application code and upstream LLM providers — routing, authenticating, observing, and failing over inference calls without any vendor SDK touching your application.
| # | Pillar | Mechanism |
|---|---|---|
| 1 | Provider Agnostic | Apps call a single unified API; core.Provider interface is the only contract |
| 2 | Comprehensive Observability | Every call emits a structured JSON log — tokens, latency, model version, hashed input |
| 3 | Decoupled Auth | Bearer-token enforcement at the gateway edge; consuming apps never handle credentials |
| 4 | Explicit Data Paths | Wire ↔ internal translation happens in one place (providers/openai); API keys are never logged |
| 5 | Built-in Resiliency | Named route chains with automatic failover on rate-limit / timeout / 5xx errors |
| 6 | Cost Transparency | Per-request omni_metadata (team, feature, env) propagated through all telemetry |
graph TB
subgraph Clients
APP[Application / SDK]
end
subgraph OmniInference Gateway
direction TB
MW_RID[Middleware: RequestID]
MW_AUTH[Middleware: Auth]
MW_OBS[Middleware: ObservabilityLog]
HANDLER[Handler: /v1/chat/completions]
ROUTER[Router + Fallback Engine]
MW_RID --> MW_AUTH --> MW_OBS --> HANDLER --> ROUTER
end
subgraph Core Domain
TYPES[core/types.go<br/>InferenceRequest / Response]
IFACE[core/provider.go<br/>Provider interface]
ERRS[core/errors.go<br/>ProviderError taxonomy]
end
subgraph Provider Registry
REG[providers/Registry]
OAI[providers/openai<br/>OpenAI · Azure OAI · vLLM]
FUTURE[providers/...<br/>Bedrock · Vertex · Anthropic]
end
subgraph Observability
HASH[internal/observability<br/>SHA-256 input hash]
LOGS[Structured JSON Logs<br/>slog — stdout]
end
subgraph Upstream LLMs
LLM1[OpenAI / Azure OpenAI]
LLM2[Local vLLM]
LLM3[AWS Bedrock ·<br/>Vertex AI · Anthropic]
end
APP -->|POST /v1/chat/completions<br/>Bearer token| MW_RID
ROUTER --> REG
REG --> OAI
REG -.->|future| FUTURE
OAI --> LLM1
OAI --> LLM2
FUTURE -.-> LLM3
HANDLER --> HASH
HASH --> LOGS
MW_OBS --> LOGS
TYPES -.-> HANDLER
IFACE -.-> OAI
ERRS -.-> ROUTER
The sequence below shows a request that hits a rate-limit on the primary provider and automatically fails over to a secondary provider.
sequenceDiagram
autonumber
actor Client
participant GW as Gateway<br/>(HTTP Server)
participant MW_RID as Middleware<br/>RequestID
participant MW_AUTH as Middleware<br/>Auth
participant MW_OBS as Middleware<br/>ObsLog
participant Handler as Handler<br/>/v1/chat/completions
participant Router as Router<br/>Fallback Engine
participant Hash as Observability<br/>SHA-256 Hash
participant P1 as Provider<br/>azure-openai-east
participant P2 as Provider<br/>azure-openai-west
Client->>GW: POST /v1/chat/completions<br/>Authorization: Bearer <token>
GW->>MW_RID: forward request
MW_RID->>MW_RID: generate / reuse X-Request-ID
MW_RID-->>GW: inject ID into context + response header
GW->>MW_AUTH: forward request
MW_AUTH->>MW_AUTH: validate Bearer token
alt Invalid token
MW_AUTH-->>Client: 401 Unauthorized
end
GW->>MW_OBS: forward request
MW_OBS->>MW_OBS: record start time
MW_OBS->>Handler: forward request
Handler->>Handler: decode JSON body<br/>build InferenceRequest
Handler->>Hash: HashMessages(req.Messages)
Hash-->>Handler: SHA-256 hex digest<br/>(no raw PII stored)
Handler->>Router: Complete(ctx, req, omni_route, omni_provider)
Router->>Router: resolve provider chain<br/>["azure-openai-east", "azure-openai-west"]
Router->>P1: Complete(ctx, req) — attempt 1
P1-->>Router: ProviderError{Kind: rate_limit, HTTP: 429}
Note over Router: ErrKind.IsRetryable() == true<br/>advance to next provider
Router->>P2: Complete(ctx, req) — attempt 2 (fallback)
P2-->>Router: InferenceResponse{choices, usage, telemetry}
Router->>Router: set FallbackOccurred=true<br/>RoutedProvider="azure-openai-west"
Router-->>Handler: InferenceResponse
Handler->>Handler: stamp GatewayLatency + InputHash
Handler->>MW_OBS: emit InferenceLog<br/>{request_id, routed_provider, input_hash,<br/>prompt_tokens, completion_tokens,<br/>provider_latency, gateway_latency,<br/>fallback_occurred, metadata}
Handler-->>MW_OBS: 200 OK + JSON response body
MW_OBS->>MW_OBS: emit request log<br/>{method, path, status, gateway_latency}
MW_OBS-->>Client: 200 OK + InferenceResponse<br/>X-Request-ID: <id>
OmniInference/
├── cmd/
│ └── omniinference/
│ └── main.go # Entrypoint — reads env config, starts server
├── core/
│ ├── types.go # InferenceRequest, InferenceResponse, ModelRef, Usage, Telemetry
│ ├── provider.go # Provider interface + InferenceStreamChunk
│ └── errors.go # ProviderError, ErrKind taxonomy
├── providers/
│ ├── registry.go # Registry factory (NewRegistry, NewRegistryFromProviders)
│ └── openai/
│ └── adapter.go # OpenAI-compatible adapter (OpenAI · Azure OAI · vLLM)
├── gateway/
│ ├── config.go # Config, ConfigFromEnv, MarshalSafe (redacts secrets)
│ ├── server.go # Server — HTTP lifecycle + middleware chain assembly
│ ├── handler.go # /v1/chat/completions handler
│ ├── middleware/
│ │ ├── request_id.go # X-Request-ID injection
│ │ ├── auth.go # Bearer-token gate (Pillar 3)
│ │ └── observability.go # Structured slog request + inference telemetry
│ └── router/
│ └── router.go # Router — named route chains, automatic failover (Pillar 5)
└── internal/
└── observability/
└── hash.go # SHA-256 input hashing (Pillar 4)
| Variable | Required | Description |
|---|---|---|
OMNI_OPENAI_API_KEY |
yes | Bearer token for the OpenAI-compatible provider |
OMNI_OPENAI_BASE_URL |
no | Override endpoint (Azure OAI, vLLM, etc.) Default: https://api.openai.com/v1 |
OMNI_AUTH_TOKENS |
no | Comma-separated list of valid client bearer tokens. Empty = auth disabled |
OMNI_DEFAULT_PROVIDER |
no | Provider name to use when no route is specified. Default: openai |
OMNI_PORT |
no | Listen port. Default: 8080 |
OMNI_LOG_LEVEL |
no | debug | info | warn | error. Default: info |
export OMNI_OPENAI_API_KEY=sk-...
export OMNI_AUTH_TOKENS=my-local-token
go run ./cmd/omniinferencecurl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer my-local-token" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}],
"omni_metadata": {"team": "platform", "feature": "chat"}
}'Use omni_route to select a named route chain configured on the gateway:
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer my-local-token" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}],
"omni_route": "primary-chain",
"omni_metadata": {"team": "platform", "feature": "chat"}
}'curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer my-local-token" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}],
"omni_provider": "vllm-local"
}'curl http://localhost:8080/healthz
# {"status":"ok"}Every inference call emits two structured JSON log lines to stdout.
Request log (emitted by ObservabilityLog middleware):
{
"time": "2026-06-25T12:00:00Z",
"level": "INFO",
"msg": "request",
"request_id": "a3f1...",
"method": "POST",
"path": "/v1/chat/completions",
"status": 200,
"response_bytes": 512,
"gateway_latency": 423000000
}Inference log (emitted by the handler after dispatch):
{
"time": "2026-06-25T12:00:00Z",
"level": "INFO",
"msg": "inference",
"request_id": "a3f1...",
"routed_provider": "azure-openai-west",
"input_hash": "e3b0c44298fc1c14...",
"prompt_tokens": 42,
"completion_tokens": 87,
"total_tokens": 129,
"provider_latency": 380000000,
"gateway_latency": 423000000,
"fallback_occurred": true,
"metadata": "{\"team\":\"platform\",\"feature\":\"chat\"}"
}go test ./...
go vet ./...- Additional provider adapters: AWS Bedrock, Vertex AI, Anthropic
- Retry backoff with jitter (exponential, configurable per route)
- Per-key / per-team rate limiting
- Streaming SSE passthrough (
/v1/chat/completionswithstream: true) - YAML config file loader (supplement env-var config)
- Prometheus metrics exporter
- DB-backed audit log persistence
- Admin API: live route config reload, provider health status