A FastAPI server that gives any non-function-calling LLM an OpenAI-compatible
/v1/chat/completions endpoint — including full SSE streaming and multi-turn
tool call loops.
- Live diagram:
swimlanes.iolink - Preview image:
POST /v1/chat/completions: OpenAI-compatible Chat Completions with tool-call wrapping (streaming + non-streaming).GET /v1/models,GET /v1/models/{model}: Minimal models endpoint for OpenAI SDK compatibility.- Catch-all passthrough: Any other path is reverse-proxied to your backend (e.g.
/v1/embeddings,/v1/audio/*,/v1/fine_tuning/*). GET /health: Wrapper health info.
agent-tools-proxy/
├── app/
│ ├── main.py # FastAPI app, lifespan, CORS
│ ├── config.py # Settings via pydantic-settings + .env
│ ├── models/
│ │ ├── openai.py # OpenAI-spec request/response Pydantic models
│ ├── core/
│ │ ├── adapters.py # Backend adapters (ollama vs openai-compatible)
│ │ ├── prompt.py # Tool schema → system prompt injection
│ │ ├── buffer.py # Streaming brace-depth buffer & detector
│ │ └── formatter.py # LLM output → OpenAI SSE chunk formatter
│ └── routers/
│ ├── chat.py # POST /v1/chat/completions handler
│ ├── models.py # GET /v1/models
│ └── proxy.py # Catch-all reverse proxy passthrough
├── tests/
│ ├── test_prompt.py
│ ├── test_buffer.py
│ └── test_formatter.py
├── conftest.py
├── pyproject.toml
├── uv.lock
├── Dockerfile
├── .dockerignore
└── .env.example
# 1. install (uv)
uv sync --dev
# 2. configure
cp .env.example .env # set LLM_BACKEND, LLM_BASE_URL, LLM_MODEL
# 3. start your backend (e.g. Ollama)
ollama serve && ollama pull llama3.1
# 4. run the wrapper
uv run uvicorn app.main:app --reload --port 8080
# 5. run tests (no backend needed)
uv run pytest tests/ -vdocker build -t agent-tools-proxy:local .
# Example: connect to local Ollama from inside container (macOS)
docker run --rm -p 8080:8080 \
-e LLM_BACKEND=ollama \
-e LLM_BASE_URL=http://host.docker.internal:11434 \
-e LLM_MODEL=llama3.1 \
agent-tools-proxy:local
# with .env file
docker run --rm -p 8080:8080 \
-v $(pwd)/.env:/app/.env \
agent-tools-proxy:local
# Example: connect to local vLLM from inside container (macOS)
docker run --rm -p 8080:8080 \
-e LLM_BACKEND=openai \
-e LLM_BASE_URL=https://vllm.zalopay.vn \
-e LLM_MODEL=gemma-3-27b \
agent-tools-proxy:localThe wrapper uses a pluggable adapter layer in app/core/adapters.py:
LLM_BACKEND=ollama: talks to Ollama/api/chat(NDJSON streaming)LLM_BACKEND=openai: talks to OpenAI-compatible/v1/chat/completionsbackends (SSE streaming)
| Variable | Default | Description |
|---|---|---|
LLM_BACKEND |
ollama |
Backend adapter: ollama or openai |
LLM_BASE_URL |
http://localhost:11434 |
LLM backend base URL |
LLM_MODEL |
llama3.1 |
Model name passed to backend |
LLM_API_KEY |
`` | Bearer token if backend requires auth |
TOOL_CALL_OPEN_TOKEN |
{"tool_call" |
Prefix that signals a tool call in the stream |
LOG_LEVEL |
info |
Python logging level |
- One tool call per turn — parallel calls not supported
- Context grows with tool rounds — consider summarizing old results after N turns
- Prompt-based tool calling — some models may still produce malformed tool JSON (we fall back to treating it as plain content)
