fix(openai-compat): move tool schemas to system prompt to eliminate per-turn latency spikes by fcoppey · Pull Request #43 · Enderfga/openclaw-claude-code

fcoppey · 2026-04-14T13:58:08Z

Summary

When a /v1/chat/completions request includes tools, the proxy currently prepends an <available_tools> block to every user message. For callers with many tools (e.g. OpenClaw gateway routing 90+ MCP tools from Home Assistant), this block can be 50+ KB and is sent on every turn.

This causes a reproducible pattern of 30–50 second latency spikes every ~4 calls against otherwise-warm sessions. The fix moves the tool block into the session system prompt at session-create time, so user messages stay small and Anthropic's prompt cache can reliably hit the tool definitions.

A fully backward-compatible opt-out env var (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1) preserves the pre-fix behavior for callers who need to mutate their tool list within a single session.

Repro / Bisection

Using OpenClaw gateway (openclaw/openclaw) routed through claude-code-skill serve on port 18796, an agent with 93 MCP tools (home-assistant + anomaly-rules) on claude-sonnet-4-6:

call 1  wall= 9358ms gateway=  5947ms
call 2  wall=10352ms gateway=  7044ms
call 3  wall= 6700ms gateway=  3377ms
call 4  wall=42429ms gateway= 39192ms   ← SPIKE
call 5  wall=14722ms gateway= 11417ms
call 6  wall= 9954ms gateway=  6640ms
call 7  wall=11541ms gateway=  8210ms
call 8  wall=32902ms gateway= 29511ms   ← SPIKE

Layer bisection (independent reproduction)

Layer	12 calls	Spikes
Raw `claude-code-skill session-send` (tiny message)	1.8–5.2 s	0/12
Raw `session-send` with a 54 KB user message	5–45 s	3/10
Direct `/v1/chat/completions` POST, 3-tool payload	2.4–6.4 s	0/12
Direct `/v1/chat/completions` POST, 93-tool payload	reproduces	~1/4
OpenClaw gateway → this proxy, 93-tool payload	reproduces	~1/4

The trigger is the 54 KB tool block, not the number of sessions or proxy versus direct invocation. A raw session-send with a tiny message never spikes; the same CLI with a 54 KB user message does. The proxy adds one such block to every user turn.

MITM trace of a spike

Captured with a small HTTP MITM between OpenClaw gateway and the proxy:

#4 IN   size=165829 msgs=56 tools=93 stream=true
#4 HDR  +9ms status=200 type=text/event-stream
#4 END  total=44531ms ttfb=9ms chunks=5 firstTimings=[9,30010,44531]

firstTimings=[9, 30010, 44531] shows the second SSE chunk arrives at precisely 30 010 ms — that's the setInterval(..., 30_000) keepalive comment at openai-compat.ts:562, firing because the CLI produced zero output for 30 s. The real content chunk only arrives at 44.5 s when the CLI finally responds.

After the fix (same config, 15-call bench, default mode)

call  1: wall= 9358ms gateway= 5947ms   ← cold (session create)
call  2: wall=10352ms gateway= 7044ms
call  3: wall= 6700ms gateway= 3377ms
…
call 14: wall= 6606ms gateway= 3212ms
call 15: wall= 7872ms gateway= 4472ms

Warm gateway time: 3.0–4.6 s, median ~3.3 s. Zero spikes across 30+ calls in two independent 15-call runs.

With opt-out (`OPENAI_COMPAT_TOOLS_PER_MESSAGE=1`)

Legacy behavior is faithfully restored:

call 1  wall= 24651ms gateway= 11478ms   ← cold
call 2  wall= 10865ms gateway=  7476ms
call 3  wall=  7726ms gateway=  4391ms
call 4  wall= 29005ms gateway= 25632ms   ← SPIKE
call 5  wall= 19337ms gateway= 15877ms
call 6  wall=  9498ms gateway=  6215ms
call 7  wall= 36590ms gateway= 33330ms   ← SPIKE
call 8  wall=  7547ms gateway=  4288ms

The fix

1. Move `<available_tools>` from user message → session system prompt (default)

Before (openai-compat.ts around line 571–576):

const hasTools = !!request.tools?.length;
if (hasTools) {
  const toolBlock = buildToolPromptBlock(request.tools);
  userMessage = \`\${toolBlock}\n\n\${userMessage}\`;
}

After:

// Default: tools already in system prompt at session-create time.
// Opt-out (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1): inject per-turn.
// Non-claude: always inject per-turn.
const hasTools = !!request.tools?.length;
const injectToolsPerTurn = hasTools && (engine !== 'claude' || isToolsPerMessageModeEnabled());
if (injectToolsPerTurn) {
  const toolBlock = buildToolPromptBlock(request.tools);
  userMessage = \`\${toolBlock}\n\n\${userMessage}\`;
}

And in the session-create block (around line 519):

if (request.tools?.length) {
  const toolBlock = buildToolPromptBlock(request.tools);
  const systemWithTools = \`\${noToolsPromptInSystem}\n\n\${toolBlock}\`;
  sessionConfig.systemPrompt = extracted.systemPrompt
    ? \`\${systemWithTools}\n\n\${extracted.systemPrompt}\`
    : systemWithTools;
}

(Gated by isToolsPerMessageModeEnabled() so legacy callers can opt out.)

2. Include a tool fingerprint in `resolveSessionKey` (default)

Without this, two callers with the same system prompt but different tool lists would share a session whose system prompt was baked with the first caller's tools. The fingerprint is a short stable hash of toolName + descriptionPrefix joined across the tool array, so tool changes spawn a new session:

const toolsFingerprint = isToolsPerMessageModeEnabled()
  ? ''
  : (body.tools || [])
      .map((t) => \`\${t.function.name}:\${(t.function.description || '').slice(0, 64)}\`)
      .filter(Boolean)
      .join('|');
// hash: model + '\n' + system + '\n' + toolsFingerprint

Behavior change (explicit)

Default mode (env var unset) — new behavior:

Tools are embedded in the session system prompt at create time (stable, cacheable).
Session key fingerprints the tool list, so changing tools spawns a new session.
Consequence: a caller that calls with tools=[X] then tools=[X,Y] in the same conversation now gets two separate sessions, losing conversation history across the tool change. This is a change from prior behavior where the same session was reused and the new tool list was silently re-injected per turn.
This is semantically more correct (the new tools don't apply retroactively to history recorded before they existed) and is how most real-world callers use the endpoint (tool lists are stable per agent).
Eliminates the latency spike class documented above.

Legacy mode (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1) — pre-fix behavior:

Tools injected into every user message (not system prompt).
Session key ignores tools (single session across tool changes).
Full backwards compatibility with callers relying on dynamic tool lists mid-session.
Subject to the original spike bug.

Behavior preserved

Non-Claude engines (Codex, Gemini, Cursor) unchanged — still receive the tool block per turn because their CLIs spawn fresh per message.
<tool_calls> response parsing unchanged — the model still sees <available_tools> and emits <tool_calls> tags the same way.
Sessions without tools — unchanged (appendSystemPrompt path).
Streaming + bufferedText parsing — unchanged.
Auto-compact (80% context threshold) — unchanged.
X-Session-Reset / isNewConversation ephemeral session cleanup — unchanged.
All 410 existing tests still pass.

Tests added

Four new tests in src/__tests__/openai-compat.test.ts:

resolveSessionKey (default mode):

distinct tool lists → distinct session keys (prevents stale tool-block reuse)
identical tool lists → deterministic session key (guarantees reuse when expected)

resolveSessionKey (opt-out mode):

OPENAI_COMPAT_TOOLS_PER_MESSAGE=1 collapses all tool variants to one session key

isToolsPerMessageModeEnabled:

parses 1, true, yes (case-insensitive, trimmed) as enabled; everything else as disabled

Test plan

npm test — all 414 tests pass (4 new, 410 existing)
npm run build — clean TypeScript build
Installed into /opt/homebrew/lib/node_modules/@enderfga/openclaw-claude-code/dist/src/openai-compat.js
Restarted claude-code-skill serve process
Verified via patch log the new code path is exercised on each request (userMessage.length=36 instead of ~54 000)
Default mode: 8-call bench, 15-call bench × 2 — no spikes (30 calls total, median warm 3.3 s)
Opt-out mode: 8-call bench — spikes reproduced on calls 4 and 7, matching pre-fix behavior
Upstream CI

🤖 Generated with Claude Code

…er-turn latency spikes Currently when a /v1/chat/completions request includes `tools`, the proxy prepends a `<available_tools>` block to EVERY user message. For callers with many tools (e.g. OpenClaw gateway routing 90+ MCP tools), this block can be 50+ KB and is sent on every turn. This causes a reproducible pattern of 30-50s latency spikes every ~4 calls against otherwise-warm sessions. Isolated via 4-layer bisection: layer | 12 calls | spikes ------------------------------------------|-------------|------- Raw `session-send` (tiny message) | 1.8-5.2s | 0/12 Raw `session-send` (54KB message) | 5-45s | 3/10 Direct /v1/chat/completions, 3 tools | 2.4-6.4s | 0/12 Direct /v1/chat/completions, 93 tools | repro'd | ~1/4 The 54 KB tool block is the trigger; the CLI subprocess hits periodic slow paths (likely Anthropic prompt-cache miss + full re-tokenization) when every user message carries it. Fix: when `engine === 'claude'` and tools are provided, embed the <available_tools> block in the session system prompt at session-create time (via `--system-prompt`). User messages then stay small and stable, letting Anthropic's prompt cache reliably hit the tool block. This required one supporting change: - `resolveSessionKey` now fingerprints the tool list (tool names + description prefixes) alongside the system prompt. A caller swapping tool lists mid-conversation now lands in a new session instead of reusing a stale one whose system prompt was baked with the old tools. Behavior for non-claude engines (Codex, Gemini, Cursor) is unchanged — they still receive the tool block per turn because their CLIs are spawned fresh per message with no persistent system prompt. Behavior change — opt-out for callers who mutate tool lists mid-session: Set `OPENAI_COMPAT_TOOLS_PER_MESSAGE=1` to restore the pre-fix behavior of injecting the tool block into every user message and keying sessions only by system prompt + model. Use this if you have callers that dynamically change their tool list within a single conversation and rely on continuing history across tool changes. The default (env unset) uses the new system-prompt injection and eliminates latency spikes. Measured impact (OpenClaw gateway + nora-oc agent, 93 HA/anomaly-rules tools, 15-call bench, Anthropic streaming): before: 7-10s warm, 40-50s spike every ~4 calls after (default): 3.0-3.9s warm, 0 spikes in 30+ calls after (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1): matches pre-fix behavior (spikes restored) Tests: all 410 existing tests pass. Four new tests verify: - resolveSessionKey produces distinct keys when same system prompt has different tool lists (default mode) - resolveSessionKey is deterministic for identical tool lists - OPENAI_COMPAT_TOOLS_PER_MESSAGE=1 collapses all tool variants to one session key (legacy opt-out mode) - isToolsPerMessageModeEnabled() correctly parses env values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…add tests - Extract `noToolsSystemPrompt(location)` factory to eliminate near-duplicate system prompt strings that differed by one phrase and would drift over time - Extract `buildSessionSystemPrompt()` as an exported, testable helper that encapsulates the default vs legacy system prompt construction logic - Simplify the session create block in handleChatCompletion to a single call - Add tests for noToolsSystemPrompt, buildSessionSystemPrompt (default + legacy modes, with and without caller system prompt) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fcoppey force-pushed the fix/move-tools-to-system-prompt branch from 4cd812e to ac7350a Compare April 14, 2026 14:14

Enderfga merged commit 23ce87b into Enderfga:main Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(openai-compat): move tool schemas to system prompt to eliminate per-turn latency spikes#43

fix(openai-compat): move tool schemas to system prompt to eliminate per-turn latency spikes#43
Enderfga merged 2 commits intoEnderfga:mainfrom
fcoppey:fix/move-tools-to-system-prompt

fcoppey commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fcoppey commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Repro / Bisection

Layer bisection (independent reproduction)

MITM trace of a spike

After the fix (same config, 15-call bench, default mode)

With opt-out (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1)

The fix

1. Move <available_tools> from user message → session system prompt (default)

2. Include a tool fingerprint in resolveSessionKey (default)

Behavior change (explicit)

Behavior preserved

Tests added

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fcoppey commented Apr 14, 2026 •

edited

Loading

With opt-out (`OPENAI_COMPAT_TOOLS_PER_MESSAGE=1`)

1. Move `<available_tools>` from user message → session system prompt (default)

2. Include a tool fingerprint in `resolveSessionKey` (default)