Skip to content

fix(openai-compat): move tool schemas to system prompt to eliminate per-turn latency spikes#43

Merged
Enderfga merged 2 commits intoEnderfga:mainfrom
fcoppey:fix/move-tools-to-system-prompt
Apr 16, 2026
Merged

fix(openai-compat): move tool schemas to system prompt to eliminate per-turn latency spikes#43
Enderfga merged 2 commits intoEnderfga:mainfrom
fcoppey:fix/move-tools-to-system-prompt

Conversation

@fcoppey
Copy link
Copy Markdown
Contributor

@fcoppey fcoppey commented Apr 14, 2026

Summary

When a /v1/chat/completions request includes tools, the proxy currently prepends an <available_tools> block to every user message. For callers with many tools (e.g. OpenClaw gateway routing 90+ MCP tools from Home Assistant), this block can be 50+ KB and is sent on every turn.

This causes a reproducible pattern of 30–50 second latency spikes every ~4 calls against otherwise-warm sessions. The fix moves the tool block into the session system prompt at session-create time, so user messages stay small and Anthropic's prompt cache can reliably hit the tool definitions.

A fully backward-compatible opt-out env var (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1) preserves the pre-fix behavior for callers who need to mutate their tool list within a single session.

Repro / Bisection

Using OpenClaw gateway (openclaw/openclaw) routed through claude-code-skill serve on port 18796, an agent with 93 MCP tools (home-assistant + anomaly-rules) on claude-sonnet-4-6:

call 1  wall= 9358ms gateway=  5947ms
call 2  wall=10352ms gateway=  7044ms
call 3  wall= 6700ms gateway=  3377ms
call 4  wall=42429ms gateway= 39192ms   ← SPIKE
call 5  wall=14722ms gateway= 11417ms
call 6  wall= 9954ms gateway=  6640ms
call 7  wall=11541ms gateway=  8210ms
call 8  wall=32902ms gateway= 29511ms   ← SPIKE

Layer bisection (independent reproduction)

Layer 12 calls Spikes
Raw claude-code-skill session-send (tiny message) 1.8–5.2 s 0/12
Raw session-send with a 54 KB user message 5–45 s 3/10
Direct /v1/chat/completions POST, 3-tool payload 2.4–6.4 s 0/12
Direct /v1/chat/completions POST, 93-tool payload reproduces ~1/4
OpenClaw gateway → this proxy, 93-tool payload reproduces ~1/4

The trigger is the 54 KB tool block, not the number of sessions or proxy versus direct invocation. A raw session-send with a tiny message never spikes; the same CLI with a 54 KB user message does. The proxy adds one such block to every user turn.

MITM trace of a spike

Captured with a small HTTP MITM between OpenClaw gateway and the proxy:

#4 IN   size=165829 msgs=56 tools=93 stream=true
#4 HDR  +9ms status=200 type=text/event-stream
#4 END  total=44531ms ttfb=9ms chunks=5 firstTimings=[9,30010,44531]

firstTimings=[9, 30010, 44531] shows the second SSE chunk arrives at precisely 30 010 ms — that's the setInterval(..., 30_000) keepalive comment at openai-compat.ts:562, firing because the CLI produced zero output for 30 s. The real content chunk only arrives at 44.5 s when the CLI finally responds.

After the fix (same config, 15-call bench, default mode)

call  1: wall= 9358ms gateway= 5947ms   ← cold (session create)
call  2: wall=10352ms gateway= 7044ms
call  3: wall= 6700ms gateway= 3377ms
…
call 14: wall= 6606ms gateway= 3212ms
call 15: wall= 7872ms gateway= 4472ms

Warm gateway time: 3.0–4.6 s, median ~3.3 s. Zero spikes across 30+ calls in two independent 15-call runs.

With opt-out (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1)

Legacy behavior is faithfully restored:

call 1  wall= 24651ms gateway= 11478ms   ← cold
call 2  wall= 10865ms gateway=  7476ms
call 3  wall=  7726ms gateway=  4391ms
call 4  wall= 29005ms gateway= 25632ms   ← SPIKE
call 5  wall= 19337ms gateway= 15877ms
call 6  wall=  9498ms gateway=  6215ms
call 7  wall= 36590ms gateway= 33330ms   ← SPIKE
call 8  wall=  7547ms gateway=  4288ms

The fix

1. Move <available_tools> from user message → session system prompt (default)

Before (openai-compat.ts around line 571–576):

const hasTools = !!request.tools?.length;
if (hasTools) {
  const toolBlock = buildToolPromptBlock(request.tools);
  userMessage = \`\${toolBlock}\n\n\${userMessage}\`;
}

After:

// Default: tools already in system prompt at session-create time.
// Opt-out (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1): inject per-turn.
// Non-claude: always inject per-turn.
const hasTools = !!request.tools?.length;
const injectToolsPerTurn = hasTools && (engine !== 'claude' || isToolsPerMessageModeEnabled());
if (injectToolsPerTurn) {
  const toolBlock = buildToolPromptBlock(request.tools);
  userMessage = \`\${toolBlock}\n\n\${userMessage}\`;
}

And in the session-create block (around line 519):

if (request.tools?.length) {
  const toolBlock = buildToolPromptBlock(request.tools);
  const systemWithTools = \`\${noToolsPromptInSystem}\n\n\${toolBlock}\`;
  sessionConfig.systemPrompt = extracted.systemPrompt
    ? \`\${systemWithTools}\n\n\${extracted.systemPrompt}\`
    : systemWithTools;
}

(Gated by isToolsPerMessageModeEnabled() so legacy callers can opt out.)

2. Include a tool fingerprint in resolveSessionKey (default)

Without this, two callers with the same system prompt but different tool lists would share a session whose system prompt was baked with the first caller's tools. The fingerprint is a short stable hash of toolName + descriptionPrefix joined across the tool array, so tool changes spawn a new session:

const toolsFingerprint = isToolsPerMessageModeEnabled()
  ? ''
  : (body.tools || [])
      .map((t) => \`\${t.function.name}:\${(t.function.description || '').slice(0, 64)}\`)
      .filter(Boolean)
      .join('|');
// hash: model + '\n' + system + '\n' + toolsFingerprint

Behavior change (explicit)

Default mode (env var unset) — new behavior:

  • Tools are embedded in the session system prompt at create time (stable, cacheable).
  • Session key fingerprints the tool list, so changing tools spawns a new session.
  • Consequence: a caller that calls with tools=[X] then tools=[X,Y] in the same conversation now gets two separate sessions, losing conversation history across the tool change. This is a change from prior behavior where the same session was reused and the new tool list was silently re-injected per turn.
  • This is semantically more correct (the new tools don't apply retroactively to history recorded before they existed) and is how most real-world callers use the endpoint (tool lists are stable per agent).
  • Eliminates the latency spike class documented above.

Legacy mode (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1) — pre-fix behavior:

  • Tools injected into every user message (not system prompt).
  • Session key ignores tools (single session across tool changes).
  • Full backwards compatibility with callers relying on dynamic tool lists mid-session.
  • Subject to the original spike bug.

Behavior preserved

  • Non-Claude engines (Codex, Gemini, Cursor) unchanged — still receive the tool block per turn because their CLIs spawn fresh per message.
  • <tool_calls> response parsing unchanged — the model still sees <available_tools> and emits <tool_calls> tags the same way.
  • Sessions without tools — unchanged (appendSystemPrompt path).
  • Streaming + bufferedText parsing — unchanged.
  • Auto-compact (80% context threshold) — unchanged.
  • X-Session-Reset / isNewConversation ephemeral session cleanup — unchanged.
  • All 410 existing tests still pass.

Tests added

Four new tests in src/__tests__/openai-compat.test.ts:

resolveSessionKey (default mode):

  • distinct tool lists → distinct session keys (prevents stale tool-block reuse)
  • identical tool lists → deterministic session key (guarantees reuse when expected)

resolveSessionKey (opt-out mode):

  • OPENAI_COMPAT_TOOLS_PER_MESSAGE=1 collapses all tool variants to one session key

isToolsPerMessageModeEnabled:

  • parses 1, true, yes (case-insensitive, trimmed) as enabled; everything else as disabled

Test plan

  • npm test — all 414 tests pass (4 new, 410 existing)
  • npm run build — clean TypeScript build
  • Installed into /opt/homebrew/lib/node_modules/@enderfga/openclaw-claude-code/dist/src/openai-compat.js
  • Restarted claude-code-skill serve process
  • Verified via patch log the new code path is exercised on each request (userMessage.length=36 instead of ~54 000)
  • Default mode: 8-call bench, 15-call bench × 2 — no spikes (30 calls total, median warm 3.3 s)
  • Opt-out mode: 8-call bench — spikes reproduced on calls 4 and 7, matching pre-fix behavior
  • Upstream CI

🤖 Generated with Claude Code

…er-turn latency spikes

Currently when a /v1/chat/completions request includes `tools`, the proxy
prepends a `<available_tools>` block to EVERY user message. For callers
with many tools (e.g. OpenClaw gateway routing 90+ MCP tools), this block
can be 50+ KB and is sent on every turn.

This causes a reproducible pattern of 30-50s latency spikes every ~4 calls
against otherwise-warm sessions. Isolated via 4-layer bisection:

  layer                                    | 12 calls    | spikes
  ------------------------------------------|-------------|-------
  Raw `session-send` (tiny message)        | 1.8-5.2s    | 0/12
  Raw `session-send` (54KB message)        | 5-45s       | 3/10
  Direct /v1/chat/completions, 3 tools     | 2.4-6.4s    | 0/12
  Direct /v1/chat/completions, 93 tools    | repro'd     | ~1/4

The 54 KB tool block is the trigger; the CLI subprocess hits periodic
slow paths (likely Anthropic prompt-cache miss + full re-tokenization)
when every user message carries it.

Fix: when `engine === 'claude'` and tools are provided, embed the
<available_tools> block in the session system prompt at session-create
time (via `--system-prompt`). User messages then stay small and stable,
letting Anthropic's prompt cache reliably hit the tool block.

This required one supporting change:
  - `resolveSessionKey` now fingerprints the tool list (tool names +
    description prefixes) alongside the system prompt. A caller swapping
    tool lists mid-conversation now lands in a new session instead of
    reusing a stale one whose system prompt was baked with the old tools.

Behavior for non-claude engines (Codex, Gemini, Cursor) is unchanged —
they still receive the tool block per turn because their CLIs are
spawned fresh per message with no persistent system prompt.

Behavior change — opt-out for callers who mutate tool lists mid-session:
Set `OPENAI_COMPAT_TOOLS_PER_MESSAGE=1` to restore the pre-fix behavior
of injecting the tool block into every user message and keying sessions
only by system prompt + model. Use this if you have callers that
dynamically change their tool list within a single conversation and rely
on continuing history across tool changes. The default (env unset) uses
the new system-prompt injection and eliminates latency spikes.

Measured impact (OpenClaw gateway + nora-oc agent, 93 HA/anomaly-rules
tools, 15-call bench, Anthropic streaming):

  before:                          7-10s warm, 40-50s spike every ~4 calls
  after (default):                 3.0-3.9s warm, 0 spikes in 30+ calls
  after (OPENAI_COMPAT_TOOLS_PER_MESSAGE=1):  matches pre-fix behavior (spikes restored)

Tests: all 410 existing tests pass. Four new tests verify:
  - resolveSessionKey produces distinct keys when same system prompt has
    different tool lists (default mode)
  - resolveSessionKey is deterministic for identical tool lists
  - OPENAI_COMPAT_TOOLS_PER_MESSAGE=1 collapses all tool variants to one
    session key (legacy opt-out mode)
  - isToolsPerMessageModeEnabled() correctly parses env values

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@fcoppey fcoppey force-pushed the fix/move-tools-to-system-prompt branch from 4cd812e to ac7350a Compare April 14, 2026 14:14
…add tests

- Extract `noToolsSystemPrompt(location)` factory to eliminate near-duplicate
  system prompt strings that differed by one phrase and would drift over time
- Extract `buildSessionSystemPrompt()` as an exported, testable helper that
  encapsulates the default vs legacy system prompt construction logic
- Simplify the session create block in handleChatCompletion to a single call
- Add tests for noToolsSystemPrompt, buildSessionSystemPrompt (default + legacy
  modes, with and without caller system prompt)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Enderfga Enderfga merged commit 23ce87b into Enderfga:main Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants