feat(pflash): compress role=tool messages in place for agent transcripts by easel · Pull Request #75 · Luce-Org/lucebox-hub

easel · 2026-05-01T17:04:47Z

Extends _maybe_compress_tool_chat so pflash compresses every role: tool message in place, instead of only the last user message. Targets the multi-turn agent-loop workload that #70's hook bypasses today.

Why

In Claude Code / OpenAI tool-call agent loops the prefix grows turn-over-turn through role: tool messages — Read / Bash / Grep output. The user-typed messages are short ("continue", "what next"). On a real 70K-char Claude Code transcript:

role	count	chars	% of bytes
system	1	78	0.1%
user	1	249	0.4%
assistant	9	255	0.4%
tool	10	69,453	99%

The current hook compresses only last_user.content (249 chars in this example). With --prefill-compression always --prefill-threshold 1 it still does the full daemon dance and leaves the 69K of tool output alone. Result: pflash adds latency and gives no benefit on agent transcripts.

What

Two file change:

dflash/scripts/_prefill_hook.py — new helper compress_texts_batch_via_daemon(prompt_texts: list). One daemon dance for N texts: park target+draft → load drafter (first compress) → run compress for each text reusing the loaded drafter → free drafter + unpark target+draft once. The existing compress_text_via_daemon is unchanged.

dflash/scripts/server_tools.py — _maybe_compress_tool_chat now:

If any role: tool message has string content, batch-compress them all in place. Each tool message keeps its role, tool_call_id, and the assistant tool_calls that pair with it. assistant.tool_calls[i].function.arguments is JSON-parsed (matching the canonical _tokenize_prompt path) before going to the chat template — Qwen's template calls .items() on it.
Else fall back to the existing last-user RAG path unchanged.
The should_compress(prompt_len) gate, the req.tools bypass, and all other guards are preserved.

Validated

10-turn replay of a real Claude Code session (helix-large), agent-loop pattern

RTX 3090 Ti, CUDA 13.2, Qwen3.6-27B Q4_K_M target, Qwen3-0.6B BF16 drafter, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05, --max-ctx 24576, prefix grows 327 → 71K chars over the 10 calls.

metric	cold (no pflash)	warm (pflash + this patch)	speedup
total TTFT (10 calls)	243.70 s	99.89 s	2.44×
total wall (10 calls)	255.84 s	108.88 s	2.35×
empty responses	0 / 10	0 / 10	—

3-turn pilot, per-call breakdown

Same hardware/config, helix turns 1–3:

call	content shape	cold TTFT	warm TTFT	speedup	tok
1	327 ch, no tool msgs	0.96 s	4.71 s	0.20× ⚠	64
2	38K ch, 2 tool msgs	16.47 s	7.59 s	2.17×	64
3	38K ch, 3 tool msgs	17.71 s	8.14 s	2.18×	47
total		35.14 s	20.43 s	1.72×

Per-tool-message compression on call 3: 6482 → 306 (21×), 5018 → 218 (23×), 78 → 14 (5.5×). Generation quality preserved — every warm call returned real completion tokens, no empty bodies, OpenAI tool structure intact through the chat template.

Call-1 caveat

The pilot's call 1 (no tool messages, 327-char prompt) regressed because pflash still does the daemon dance to compress a 249-char last_user fallback. This is masked under the default --prefill-threshold 32000 — short prompts don't cross the gate, so pflash bypasses entirely. The bench used --threshold 1 to force activation on every call. Worth noting; not worth a special case in this PR.

Compatibility

Default --prefill-compression off is unchanged. No behaviour change for anyone not opting in.
req.tools non-empty bypass is unchanged.
The RAG / single-shot path (long last user message, no tool messages) is unchanged.
Tool semantics preserved end-to-end: role: tool messages keep their tool_call_id; assistant tool_calls are passed through with the same arg-parsing the canonical tokenize path uses.
Daemon protocol is unchanged. The new helper just sequences existing compress commands within one park/unpark window.

Test plan

3-turn pilot on real Claude Code transcript with tool messages (per-call numbers above)
10-turn helix-large run (totals above)
Single-shot RAG fallback path still works (no tool messages → compresses last user, unchanged)
Sweep across smaller sessions (lucebox ≤11K, nexiq ≤25K, axon ≤39K) — first attempt was lost to a logging collision in the bench harness; will rerun and post numbers as a comment. Predicted: smaller sessions break even or slightly regress (overhead dominates) while medium+ shows the win.
NIAH single-shot bench still works post-patch (should — fallback path is preserved verbatim)
Validate on Blackwell / sm_120 (only tested on RTX 3090 Ti / sm_86)

Caveats

Per-tool-message compression means N daemon round-trips inside one drafter session. Each message at ~6K tokens compresses in ~0.4 s on a 3090 Ti, so 10 tool messages = ~4 s of scoring on top of the ~0.5 s drafter-load amortised once. For sessions with many small tool messages this is a tax; could be reduced by a min-size threshold per-message ("don't compress tool messages under 1 K chars") in a follow-up.

🤖 Generated with Claude Code

Multi-turn agent transcripts (Claude Code, OpenAI tool-call agent loops, etc) put 99% of their bytes in role=tool messages — Read / Bash / Grep output and similar accumulating turn-over-turn. The existing pflash hook only compresses the last user message, which in agent loops is a short typed prompt like "continue" or "what next". Compressing that string is a no-op while the bulk sits in tool messages pflash leaves alone. This change extends `_maybe_compress_tool_chat` to compress every role=tool message in place, preserving role / tool_call_id / assistant.tool_calls / function.arguments parsing — the OpenAI tool structure stays intact end-to-end. When there are no tool messages the existing RAG / single-shot path (compress last user) is kept as a fallback. A new helper `compress_texts_batch_via_daemon` does the daemon dance once for N texts: park target+draft → load drafter → compress each text reusing the loaded drafter → free drafter once → unpark. The existing `compress_text_via_daemon` is unchanged. Validated on a 3-turn replay of a real Claude Code session (helix-large, ~38K char prefix per call): call 1 (no tools): cold 0.96s → warm 4.71s (regression is pflash dance overhead on a 249-char last_user fallback; masked under default `--prefill-threshold 32000` because the full prompt doesn't cross the gate. Bench used `--threshold 1`) call 2 (2 tool msgs, 38K): cold 16.47s → warm 7.59s 2.17x TTFT call 3 (3 tool msgs, 38K): cold 17.71s → warm 8.14s 2.18x TTFT Per-message compression on those tool messages: 6482 → 306 tokens (21x), 5018 → 218 tokens (23x), 78 → 14 tokens (5.5x). Generation quality preserved — every warm call returned 64 real completion tokens, no empty bodies, OpenAI tool structure intact through the chat template (Qwen3 tool-call + tool-result rendering). Cost: one drafter load per request (~0.5s on hot cache), then N compresses inside the same daemon dance. Concurrent compresses share the loaded drafter weights — the daemon's `compress` command is idempotent on a loaded drafter. Tested on RTX 3090 Ti, CUDA 13.2, Qwen3.6-27B Q4_K_M target + Qwen3-0.6B BF16 drafter, `DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

easel · 2026-05-01T21:15:54Z

Follow-up: combining this patch with PR #59 (prefix cache) anti-stacks on agent loops.

Stacked the two on feat/cache-plus-pflash (this branch's commit + #59 + the cuda VMM pool-extension fix from easel/llama.cpp-dflash-ggml@fix/cuda-vmm-pool-extension-race) and re-ran the same 10-turn helix transcript. The combined stack works correctly — 10/10 warm calls return real content, no empties, daemon command surface is fine — but the speedup is lower than either layer alone:

stack	TTFT cold	TTFT warm	TTFT ×	wall ×	warm time
cache only (#59)	241.7 s	78.6 s	3.08×	1.79×	78.6 s
pflash only (this PR)	243.7 s	99.9 s	2.44×	2.35×	99.9 s
combined	237.4 s	109.3 s	2.17×	2.03×	109.3 s

Mechanism (verified via warm log):

Both layers fire correctly. 56 [compress] N → M events across the 10 calls, plus 9/10 [pc] lookup hit events. Tool messages compress 21–23×, cache populates and rotates LRU between slots normally.
They compete for the same prefill latency. Pflash already shrinks the per-call prefill from O(uncompressed_prompt) to O(compressed_prompt + scoring). The cache then RESTORE-skips a prefix of that compressed prompt — but the absolute bytes saved per hit are now much smaller than in the cache-only run (where each hit skipped prefill of a full uncompressed prompt).
Pflash's per-call dance overhead is fixed (park target/draft + drafter load + N compresses + free + unpark = ~3–5 s on a 24 GB card). Across 10 calls that's a constant ~40 s of work that the cache can't amortize away because the cache only kicks in after the dance.
Net result: pflash steals the prefill work the cache would have saved, then adds its own per-call cost. The two layers each do exactly what they claim, but the marginal value of the second one is negative.

This isn't a bug in either PR — both do what they advertise on workloads they were designed for. It's a composition problem: they target the same prefill latency, and pflash gets there first.

For agent loops specifically, cache-only (#59) at 3.08× is the winner for this hardware/workload. Pflash remains the right tool for single-shot long-prompt RAG (validated separately at 128K NIAH on the same daemon: 26.1 s vs 24.8 s headline = within 5 %).

If composition matters for production, two architectural sketches that would actually stack:

Cache-first: on lookup hit, skip the compress dance entirely (if hit: just RESTORE, no pflash). Compress only on cache miss. Each call costs either pflash or cache, never both.
Cache key on uncompressed prefix: hash the original token IDs (pre-pflash), have the daemon RESTORE on the cached snapshot, then run pflash on just the new turn's tool messages. Requires daemon-side cooperation between the two paths.

Posting this here mainly as data — happy to send a sketch PR for #1 (cache-first) on top of this one if it's interesting; #2 is bigger and probably belongs on the prefix-cache side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pflash): compress role=tool messages in place for agent transcripts#75

feat(pflash): compress role=tool messages in place for agent transcripts#75
easel wants to merge 1 commit intoLuce-Org:mainfrom
easel:feat/pflash-compress-tool-messages

easel commented May 1, 2026 •

edited

Loading

Uh oh!

easel commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

easel commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Validated

10-turn replay of a real Claude Code session (helix-large), agent-loop pattern

3-turn pilot, per-call breakdown

Call-1 caveat

Compatibility

Test plan

Caveats

Uh oh!

easel commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

easel commented May 1, 2026 •

edited

Loading