Skip to content

feat(pflash): compress role=tool messages in place for agent transcripts#75

Draft
easel wants to merge 1 commit intoLuce-Org:mainfrom
easel:feat/pflash-compress-tool-messages
Draft

feat(pflash): compress role=tool messages in place for agent transcripts#75
easel wants to merge 1 commit intoLuce-Org:mainfrom
easel:feat/pflash-compress-tool-messages

Conversation

@easel
Copy link
Copy Markdown
Contributor

@easel easel commented May 1, 2026

Extends _maybe_compress_tool_chat so pflash compresses every role: tool message in place, instead of only the last user message. Targets the multi-turn agent-loop workload that #70's hook bypasses today.

Why

In Claude Code / OpenAI tool-call agent loops the prefix grows turn-over-turn through role: tool messages — Read / Bash / Grep output. The user-typed messages are short ("continue", "what next"). On a real 70K-char Claude Code transcript:

role count chars % of bytes
system 1 78 0.1%
user 1 249 0.4%
assistant 9 255 0.4%
tool 10 69,453 99%

The current hook compresses only last_user.content (249 chars in this example). With --prefill-compression always --prefill-threshold 1 it still does the full daemon dance and leaves the 69K of tool output alone. Result: pflash adds latency and gives no benefit on agent transcripts.

What

Two file change:

dflash/scripts/_prefill_hook.py — new helper compress_texts_batch_via_daemon(prompt_texts: list). One daemon dance for N texts: park target+draft → load drafter (first compress) → run compress for each text reusing the loaded drafter → free drafter + unpark target+draft once. The existing compress_text_via_daemon is unchanged.

dflash/scripts/server_tools.py_maybe_compress_tool_chat now:

  1. If any role: tool message has string content, batch-compress them all in place. Each tool message keeps its role, tool_call_id, and the assistant tool_calls that pair with it. assistant.tool_calls[i].function.arguments is JSON-parsed (matching the canonical _tokenize_prompt path) before going to the chat template — Qwen's template calls .items() on it.
  2. Else fall back to the existing last-user RAG path unchanged.
  3. The should_compress(prompt_len) gate, the req.tools bypass, and all other guards are preserved.

Validated

10-turn replay of a real Claude Code session (helix-large), agent-loop pattern

RTX 3090 Ti, CUDA 13.2, Qwen3.6-27B Q4_K_M target, Qwen3-0.6B BF16 drafter, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05, --max-ctx 24576, prefix grows 327 → 71K chars over the 10 calls.

metric cold (no pflash) warm (pflash + this patch) speedup
total TTFT (10 calls) 243.70 s 99.89 s 2.44×
total wall (10 calls) 255.84 s 108.88 s 2.35×
empty responses 0 / 10 0 / 10

3-turn pilot, per-call breakdown

Same hardware/config, helix turns 1–3:

call content shape cold TTFT warm TTFT speedup tok
1 327 ch, no tool msgs 0.96 s 4.71 s 0.20× ⚠ 64
2 38K ch, 2 tool msgs 16.47 s 7.59 s 2.17× 64
3 38K ch, 3 tool msgs 17.71 s 8.14 s 2.18× 47
total 35.14 s 20.43 s 1.72×

Per-tool-message compression on call 3: 6482 → 306 (21×), 5018 → 218 (23×), 78 → 14 (5.5×). Generation quality preserved — every warm call returned real completion tokens, no empty bodies, OpenAI tool structure intact through the chat template.

Call-1 caveat

The pilot's call 1 (no tool messages, 327-char prompt) regressed because pflash still does the daemon dance to compress a 249-char last_user fallback. This is masked under the default --prefill-threshold 32000 — short prompts don't cross the gate, so pflash bypasses entirely. The bench used --threshold 1 to force activation on every call. Worth noting; not worth a special case in this PR.

Compatibility

  • Default --prefill-compression off is unchanged. No behaviour change for anyone not opting in.
  • req.tools non-empty bypass is unchanged.
  • The RAG / single-shot path (long last user message, no tool messages) is unchanged.
  • Tool semantics preserved end-to-end: role: tool messages keep their tool_call_id; assistant tool_calls are passed through with the same arg-parsing the canonical tokenize path uses.
  • Daemon protocol is unchanged. The new helper just sequences existing compress commands within one park/unpark window.

Test plan

  • 3-turn pilot on real Claude Code transcript with tool messages (per-call numbers above)
  • 10-turn helix-large run (totals above)
  • Single-shot RAG fallback path still works (no tool messages → compresses last user, unchanged)
  • Sweep across smaller sessions (lucebox ≤11K, nexiq ≤25K, axon ≤39K) — first attempt was lost to a logging collision in the bench harness; will rerun and post numbers as a comment. Predicted: smaller sessions break even or slightly regress (overhead dominates) while medium+ shows the win.
  • NIAH single-shot bench still works post-patch (should — fallback path is preserved verbatim)
  • Validate on Blackwell / sm_120 (only tested on RTX 3090 Ti / sm_86)

Caveats

  • Per-tool-message compression means N daemon round-trips inside one drafter session. Each message at ~6K tokens compresses in ~0.4 s on a 3090 Ti, so 10 tool messages = ~4 s of scoring on top of the ~0.5 s drafter-load amortised once. For sessions with many small tool messages this is a tax; could be reduced by a min-size threshold per-message ("don't compress tool messages under 1 K chars") in a follow-up.

🤖 Generated with Claude Code

Multi-turn agent transcripts (Claude Code, OpenAI tool-call agent
loops, etc) put 99% of their bytes in role=tool messages — Read /
Bash / Grep output and similar accumulating turn-over-turn. The
existing pflash hook only compresses the last user message, which in
agent loops is a short typed prompt like "continue" or "what next".
Compressing that string is a no-op while the bulk sits in tool
messages pflash leaves alone.

This change extends `_maybe_compress_tool_chat` to compress every
role=tool message in place, preserving role / tool_call_id /
assistant.tool_calls / function.arguments parsing — the OpenAI tool
structure stays intact end-to-end. When there are no tool messages
the existing RAG / single-shot path (compress last user) is kept as
a fallback.

A new helper `compress_texts_batch_via_daemon` does the daemon dance
once for N texts: park target+draft → load drafter → compress each
text reusing the loaded drafter → free drafter once → unpark. The
existing `compress_text_via_daemon` is unchanged.

Validated on a 3-turn replay of a real Claude Code session
(helix-large, ~38K char prefix per call):

  call 1 (no tools):           cold 0.96s → warm 4.71s  (regression
    is pflash dance overhead on a 249-char last_user fallback;
    masked under default `--prefill-threshold 32000` because the
    full prompt doesn't cross the gate. Bench used `--threshold 1`)
  call 2 (2 tool msgs, 38K):   cold 16.47s → warm 7.59s   2.17x TTFT
  call 3 (3 tool msgs, 38K):   cold 17.71s → warm 8.14s   2.18x TTFT

Per-message compression on those tool messages: 6482 → 306 tokens
(21x), 5018 → 218 tokens (23x), 78 → 14 tokens (5.5x). Generation
quality preserved — every warm call returned 64 real completion
tokens, no empty bodies, OpenAI tool structure intact through the
chat template (Qwen3 tool-call + tool-result rendering).

Cost: one drafter load per request (~0.5s on hot cache), then
N compresses inside the same daemon dance. Concurrent compresses
share the loaded drafter weights — the daemon's `compress` command
is idempotent on a loaded drafter.

Tested on RTX 3090 Ti, CUDA 13.2, Qwen3.6-27B Q4_K_M target +
Qwen3-0.6B BF16 drafter, `DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85
keep_ratio=0.05`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@easel
Copy link
Copy Markdown
Contributor Author

easel commented May 1, 2026

Follow-up: combining this patch with PR #59 (prefix cache) anti-stacks on agent loops.

Stacked the two on feat/cache-plus-pflash (this branch's commit + #59 + the cuda VMM pool-extension fix from easel/llama.cpp-dflash-ggml@fix/cuda-vmm-pool-extension-race) and re-ran the same 10-turn helix transcript. The combined stack works correctly — 10/10 warm calls return real content, no empties, daemon command surface is fine — but the speedup is lower than either layer alone:

stack TTFT cold TTFT warm TTFT × wall × warm time
cache only (#59) 241.7 s 78.6 s 3.08× 1.79× 78.6 s
pflash only (this PR) 243.7 s 99.9 s 2.44× 2.35× 99.9 s
combined 237.4 s 109.3 s 2.17× 2.03× 109.3 s

Mechanism (verified via warm log):

  • Both layers fire correctly. 56 [compress] N → M events across the 10 calls, plus 9/10 [pc] lookup hit events. Tool messages compress 21–23×, cache populates and rotates LRU between slots normally.
  • They compete for the same prefill latency. Pflash already shrinks the per-call prefill from O(uncompressed_prompt) to O(compressed_prompt + scoring). The cache then RESTORE-skips a prefix of that compressed prompt — but the absolute bytes saved per hit are now much smaller than in the cache-only run (where each hit skipped prefill of a full uncompressed prompt).
  • Pflash's per-call dance overhead is fixed (park target/draft + drafter load + N compresses + free + unpark = ~3–5 s on a 24 GB card). Across 10 calls that's a constant ~40 s of work that the cache can't amortize away because the cache only kicks in after the dance.
  • Net result: pflash steals the prefill work the cache would have saved, then adds its own per-call cost. The two layers each do exactly what they claim, but the marginal value of the second one is negative.

This isn't a bug in either PR — both do what they advertise on workloads they were designed for. It's a composition problem: they target the same prefill latency, and pflash gets there first.

For agent loops specifically, cache-only (#59) at 3.08× is the winner for this hardware/workload. Pflash remains the right tool for single-shot long-prompt RAG (validated separately at 128K NIAH on the same daemon: 26.1 s vs 24.8 s headline = within 5 %).

If composition matters for production, two architectural sketches that would actually stack:

  1. Cache-first: on lookup hit, skip the compress dance entirely (if hit: just RESTORE, no pflash). Compress only on cache miss. Each call costs either pflash or cache, never both.
  2. Cache key on uncompressed prefix: hash the original token IDs (pre-pflash), have the daemon RESTORE on the cached snapshot, then run pflash on just the new turn's tool messages. Requires daemon-side cooperation between the two paths.

Posting this here mainly as data — happy to send a sketch PR for #1 (cache-first) on top of this one if it's interesting; #2 is bigger and probably belongs on the prefix-cache side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant