feat(pflash): compress role=tool messages in place for agent transcripts#75
feat(pflash): compress role=tool messages in place for agent transcripts#75easel wants to merge 1 commit intoLuce-Org:mainfrom
Conversation
Multi-turn agent transcripts (Claude Code, OpenAI tool-call agent
loops, etc) put 99% of their bytes in role=tool messages — Read /
Bash / Grep output and similar accumulating turn-over-turn. The
existing pflash hook only compresses the last user message, which in
agent loops is a short typed prompt like "continue" or "what next".
Compressing that string is a no-op while the bulk sits in tool
messages pflash leaves alone.
This change extends `_maybe_compress_tool_chat` to compress every
role=tool message in place, preserving role / tool_call_id /
assistant.tool_calls / function.arguments parsing — the OpenAI tool
structure stays intact end-to-end. When there are no tool messages
the existing RAG / single-shot path (compress last user) is kept as
a fallback.
A new helper `compress_texts_batch_via_daemon` does the daemon dance
once for N texts: park target+draft → load drafter → compress each
text reusing the loaded drafter → free drafter once → unpark. The
existing `compress_text_via_daemon` is unchanged.
Validated on a 3-turn replay of a real Claude Code session
(helix-large, ~38K char prefix per call):
call 1 (no tools): cold 0.96s → warm 4.71s (regression
is pflash dance overhead on a 249-char last_user fallback;
masked under default `--prefill-threshold 32000` because the
full prompt doesn't cross the gate. Bench used `--threshold 1`)
call 2 (2 tool msgs, 38K): cold 16.47s → warm 7.59s 2.17x TTFT
call 3 (3 tool msgs, 38K): cold 17.71s → warm 8.14s 2.18x TTFT
Per-message compression on those tool messages: 6482 → 306 tokens
(21x), 5018 → 218 tokens (23x), 78 → 14 tokens (5.5x). Generation
quality preserved — every warm call returned 64 real completion
tokens, no empty bodies, OpenAI tool structure intact through the
chat template (Qwen3 tool-call + tool-result rendering).
Cost: one drafter load per request (~0.5s on hot cache), then
N compresses inside the same daemon dance. Concurrent compresses
share the loaded drafter weights — the daemon's `compress` command
is idempotent on a loaded drafter.
Tested on RTX 3090 Ti, CUDA 13.2, Qwen3.6-27B Q4_K_M target +
Qwen3-0.6B BF16 drafter, `DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85
keep_ratio=0.05`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
|
Follow-up: combining this patch with PR #59 (prefix cache) anti-stacks on agent loops. Stacked the two on
Mechanism (verified via warm log):
This isn't a bug in either PR — both do what they advertise on workloads they were designed for. It's a composition problem: they target the same prefill latency, and pflash gets there first. For agent loops specifically, cache-only (#59) at 3.08× is the winner for this hardware/workload. Pflash remains the right tool for single-shot long-prompt RAG (validated separately at 128K NIAH on the same daemon: 26.1 s vs 24.8 s headline = within 5 %). If composition matters for production, two architectural sketches that would actually stack:
Posting this here mainly as data — happy to send a sketch PR for #1 (cache-first) on top of this one if it's interesting; #2 is bigger and probably belongs on the prefix-cache side. |
Extends
_maybe_compress_tool_chatso pflash compresses everyrole: toolmessage in place, instead of only the last user message. Targets the multi-turn agent-loop workload that #70's hook bypasses today.Why
In Claude Code / OpenAI tool-call agent loops the prefix grows turn-over-turn through
role: toolmessages — Read / Bash / Grep output. The user-typed messages are short ("continue", "what next"). On a real 70K-char Claude Code transcript:The current hook compresses only
last_user.content(249 chars in this example). With--prefill-compression always --prefill-threshold 1it still does the full daemon dance and leaves the 69K of tool output alone. Result: pflash adds latency and gives no benefit on agent transcripts.What
Two file change:
dflash/scripts/_prefill_hook.py— new helpercompress_texts_batch_via_daemon(prompt_texts: list). One daemon dance for N texts: park target+draft → load drafter (first compress) → run compress for each text reusing the loaded drafter →free drafter+ unpark target+draft once. The existingcompress_text_via_daemonis unchanged.dflash/scripts/server_tools.py—_maybe_compress_tool_chatnow:role: toolmessage has string content, batch-compress them all in place. Each tool message keeps itsrole,tool_call_id, and the assistanttool_callsthat pair with it.assistant.tool_calls[i].function.argumentsis JSON-parsed (matching the canonical_tokenize_promptpath) before going to the chat template — Qwen's template calls.items()on it.should_compress(prompt_len)gate, thereq.toolsbypass, and all other guards are preserved.Validated
10-turn replay of a real Claude Code session (helix-large), agent-loop pattern
RTX 3090 Ti, CUDA 13.2, Qwen3.6-27B Q4_K_M target, Qwen3-0.6B BF16 drafter,
DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05,--max-ctx 24576, prefix grows 327 → 71K chars over the 10 calls.3-turn pilot, per-call breakdown
Same hardware/config, helix turns 1–3:
Per-tool-message compression on call 3:
6482 → 306(21×),5018 → 218(23×),78 → 14(5.5×). Generation quality preserved — every warm call returned real completion tokens, no empty bodies, OpenAI tool structure intact through the chat template.Call-1 caveat
The pilot's call 1 (no tool messages, 327-char prompt) regressed because pflash still does the daemon dance to compress a 249-char last_user fallback. This is masked under the default
--prefill-threshold 32000— short prompts don't cross the gate, so pflash bypasses entirely. The bench used--threshold 1to force activation on every call. Worth noting; not worth a special case in this PR.Compatibility
--prefill-compression offis unchanged. No behaviour change for anyone not opting in.req.toolsnon-empty bypass is unchanged.role: toolmessages keep theirtool_call_id; assistanttool_callsare passed through with the same arg-parsing the canonical tokenize path uses.compresscommands within one park/unpark window.Test plan
Caveats
🤖 Generated with Claude Code