perf(reply): hot-window carryover, gate, scratch cache by isair · Pull Request #281 · isair/jarvis

isair · 2026-04-24T02:11:46Z

Summary

Cuts wasted LLM and SQLite work when a conversation stays in the same hot window. Targets the current architecture (warm profile + planner-first routing + optional memory recall), not the pre-rebase one.

Tool-result carryover: DialogueMemory now persists in-loop assistant-with-tool_calls + tool-role messages; the engine re-injects them on the next turn so follow-ups synthesise from prior results instead of re-fetching. Capped via cfg.tool_carryover_max_turns / tool_carryover_per_entry_chars, scrubbed of secrets, fence markers preserved on truncation, cleared on stop, excluded from diary input. Hard storage backstop (_tool_turns_max_storage = 16) so a pathological session cannot grow the list unbounded.
Recall gate: deterministic keyword-coverage heuristic (\w{3,} + re.UNICODE, no LLM, language-agnostic) that skips diary + graph + memory-digest enrichment when the hot window already grounds a follow-up. Fail-open. Bypassed when the planner explicitly emits searchMemory so planner intent always wins (C1 fix). Spec at src/jarvis/memory/recall_gate.spec.md.
Hot-window scratch cache on DialogueMemory — one primitive (hot_cache_get/put/clear, auto-expires with RECENT_WINDOW_SEC) memoises three idempotent per-turn computations within one hot window:
- Warm profile block (skips the User + Directives SQLite traversal on every follow-up).
- Memory enrichment extractor (skips the small-model LLM call when the same query repeats).
- Tool router output (skips the router LLM call when the query and tool catalogue match; catalogue signature invalidates on MCP refresh).
  Cleared on stop alongside tool carryover.
Extended redact() patterns so debug logs and persisted tool payloads cannot leak credentials: AWS access keys (AKIA / ASIA), Stripe sk/pk/rk_(live|test)_…, GitHub PATs, OpenAI sk-…, Google AIza…, Authorization: Bearer/Basic …, plus refresh / access / id / oauth tokens and session ids.

Why this shape

The earlier architecture threw away tool messages at every turn-end and ran diary/graph recall unconditionally. The current architecture (warm profile + planner gating recall + warm chains for small models) closed the second hole; the first was still there, and three caches were paying for the same value on every follow-up. This PR closes the carryover hole and adds the cache primitive so the planner / router / extractor / warm-profile paths each pay at most once per hot window.

Test plan

tests/test_dialogue_memory_tool_carryover.py (10 tests): record/retrieve, interleaving, truncation, max-turn cap, clear, expiry, secret scrub, fence-end preserved on truncation, diary isolation
tests/test_dialogue_memory_hot_cache.py (new): get/put/expiry/clear; _tool_turns storage cap; is_tool_message helper across native, text-tool, and assistant-tool_calls shapes
tests/test_recall_gate.py (existing) + tests/test_engine_hot_window_caches.py (new) — covers planner precedence over the gate (C1 fix) and one-call-per-conversation behaviour for the warm profile, extractor, and router caches
tests/test_redact_extended.py (new): vendor access keys, Authorization headers, keyword-anchored credential variants
tests/test_engine_tool_carryover.py (existing): end-to-end two-turn integration, native + text-tool paths, stop-signal clears carryover
All focused tests pass: pytest tests/test_engine_tool_carryover.py tests/test_recall_gate.py tests/test_dialogue_memory_tool_carryover.py tests/test_dialogue_memory_hot_cache.py tests/test_redact_extended.py tests/test_engine_hot_window_caches.py tests/test_engine_planner_integration.py → 55 passed
Manual voice-mode smoke test: confirm only one webSearch fires across a 3-turn entity follow-up sequence
Specs updated: src/jarvis/reply/reply.spec.md, docs/llm_contexts.md, src/jarvis/memory/recall_gate.spec.md (new), CLAUDE.md registry

🤖 Generated with Claude Code

When a follow-up turn lands within the hot window and the prior reply already pulled fresh data via a tool, re-running webSearch / diary / graph wastes calls. Two cheap, deterministic mechanisms address this: - DialogueMemory.record_tool_turn / get_recent_turns_with_tools persist in-loop tool-call + tool-result messages so the next turn sees them as prior context. Capped by tool_carryover_max_turns and tool_carryover_per_entry_chars; secrets scrubbed; UNTRUSTED WEB EXTRACT fence markers preserved on truncation; cleared on stop. - recall_gate.should_recall: language-agnostic content-word coverage heuristic (\\w{3,} with re.UNICODE). Skips diary/graph enrichment only when the hot window already contains a fresh tool-result row covering >=50% of the query. Fail-open on any error. Wires into engine.run_reply_engine via a hasattr-guarded fetch, an idempotent _maybe_record_tool_carryover helper called on success and error paths, and a clear_tool_carryover on the stop-signal path. Adds cfg fields, settings UI entries, spec + llm_contexts updates, and 19 new tests covering carryover, scrubbing, fence preservation, stop-clear, text-tool fallback, and Cyrillic recall coverage.

Layer three optimisations on top of the carryover + recall gate so the warm-profile, extractor, and router paths each pay at most once per hot window, and tighten privacy + bounds along the way. - DialogueMemory.hot_cache_get/put/clear: per-conversation scratch cache bounded by RECENT_WINDOW_SEC, used by the engine to memoise: - warm profile block (query-agnostic SQLite traversal) - memory enrichment extractor LLM call - tool router LLM call (catalogue signature in key invalidates on mid-window MCP refresh) Cleared on the stop signal alongside tool carryover; auto-expires with the hot window. - Recall gate is now bypassed when the planner explicitly emits a searchMemory step. Planner intent always wins over coverage heuristics (C1 fix, was silently dropping memory work the planner asked for). - record_tool_turn scrubs outside the lock and bounds storage to _tool_turns_max_storage = 16 as a backstop against unbounded growth in pathological sessions. - Extended redact() vendor and keyword patterns: AWS AKIA/ASIA, Stripe sk/pk/rk_(live|test)_, GitHub PATs, OpenAI sk-, Google AIza, Authorization Bearer/Basic, plus refresh/access/id/oauth tokens and session ids. Closes the gap where carryover payloads or recall-gate debug logs could leak structurally-detectable credentials. - Adds recall_gate.spec.md (scope, heuristic, language-agnostic design, privacy, fail-open) and registers it in CLAUDE.md. - Tests: tests/test_dialogue_memory_hot_cache.py (cache primitives, is_tool_message, _tool_turns cap), tests/test_redact_extended.py (vendor + Authorization + keyword patterns), tests/test_engine_hot_window_caches.py (warm profile / extractor / router cache hits across two turns; planner overrides recall gate). - Updates docs/llm_contexts.md and reply.spec.md to document the cache primitive, planner precedence over the gate, and new redact coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…validation Extends the reply-engine scratch cache lifetime from 5-minute hot-window to the full active conversation, with proper invalidation hooks so the cached warm profile and tool router output stay correct as the conversation grows. DialogueMemory: - hot_cache_get/put no longer expire entries by RECENT_WINDOW_SEC age. Entries persist until clear_hot_cache(), invalidate_warm_profile(), or new-conversation reset in the engine. - _tool_turns no longer pruned by age either; bounded by hard storage cap (16 turns) and cleared on new-conversation entry / stop. - Adds WARM_PROFILE_CACHE_KEY constant so the engine and the graph-mutation invalidator share a single key. - Adds invalidate_warm_profile() that drops only that one entry. - Scrubs tool_calls.arguments (dict and JSON-string forms) so re- injection of the assistant tool_calls row on the next turn cannot leak PII captured in query args. - _next_ts() helper guarantees strictly-monotonic timestamps for _messages and _tool_turns, fixing interleave ordering on Windows where time.time() has ~16ms granularity. graph.py: - Adds register_graph_mutation_listener / unregister registry plus _resolve_branch helper (parent-walk to FIXED_BRANCH ancestor with depth cap). create_node / update_node / delete_node now notify listeners with action + node_id + branch attribution. delete resolves branch BEFORE the row is gone. reply/engine.py: - New-conversation entry (has_recent_messages was False) now wipes hot_cache and tool carryover before the turn runs, so stale state from a lapsed session can't leak into a fresh one. - Warm-profile cache lookup uses the shared cache key constant. daemon.py: - After DialogueMemory init, registers a graph mutation listener filtering on {BRANCH_USER, BRANCH_DIRECTIVES} that calls invalidate_warm_profile() on the global DialogueMemory. World- branch writes are noise for the warm profile and are ignored. Specs (reply, graph, llm_contexts) updated to reflect conversation- scoped cache lifetime and the new invalidation hooks. Tests: new tests/test_graph_mutation_listener.py covers the registry + branch attribution + warm-profile invalidation hook end-to-end. test_dialogue_memory_hot_cache, test_dialogue_memory_tool_carryover, and test_engine_hot_window_caches updated for new lifetime contract, tool_call arg scrubbing, and new-conversation cache wipe. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Addresses issues raised in review of a48be9e: - _scrub_tool_call now also scrubs list/tuple-form arguments via a recursive _scrub_args helper. Closes the privacy gap where a non-standard provider emitting positional args could leak secrets on tool re-injection. - daemon.py reads _global_dialogue_memory through the module global at fire time instead of capturing it by closure, so a future singleton swap (tests, hot reload) routes invalidation to the current instance. Tracks the registered listener so it can be unregistered on shutdown and on re-entry, preventing stale closures from accumulating in the module-level registry. - _next_ts() docstring now explains the Windows ~16ms granularity motivation and the lock requirement. - Tests: * direct unit tests for _next_ts() (consecutive monotonic, advance past artificially high _last_ts). * _resolve_branch returns None past MAX_TRAVERSAL_DEPTH and on unknown node ids. * mutation listener is NOT called when create_node raises (no spurious events on failed writes). * tool_call arguments scrubbed when passed as list of mixed scalars and dicts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Python requires all global statements to precede any use of the variable in the same function scope. The two inner global _warm_profile_graph_listener declarations inside try/finally blocks were triggering a SyntaxError on startup. Consolidate the declaration at the top of main() alongside the other module globals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Audit triggered by the hot-cache surface introduced in #281: does the deflection scrub need a listener-style invalidation hook similar to the graph's invalidate_warm_profile? Verified the answer is no, and pinned the contract so a future change that starts caching diary content has to opt in deliberately: - DialogueMemory's hot cache holds three values: warm_profile_block (graph-derived, not diary), the per-query router decision, and the per-query memory-extractor parameters. None contain diary text. - search_conversation_memory_by_keywords reads SQLite live on every enrichment-bearing turn; the engine never stashes diary entries across turns. - Concurrency between the sweep and an in-flight reply is handled by SQLite WAL: separate Database instances on different connections, WAL serialises writes against reads at the file level. Spec gains a "Cache invariant" paragraph documenting the rule and the counterpart relationship to the graph's listener pattern. Regression test seeds a row, populates the engine's hot cache with extractor params, runs the scrub without invalidating the cache, and asserts the follow-up search returns cleaned content. If a future change adds a diary-content cache without an invalidation path, this test still passes while the user observes stale results, so the spec note is the load-bearing piece; the test catches the obvious regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…lean button (#323) * fix(memory): deterministic deflection scrub for diary writes + bulk clean button The summariser prompt rule against narrating assistant failures (added in #232) reduces but does not eliminate the leak on small models. Field measurement on a real diary showed roughly 40% of post-rule writes on gemma4:e2b still contained banned phrasing, and pre-rule rows remained poisoned with no way to clean them. Adds a deterministic safety net mirroring the knowledge graph's two-layer defence (extractor BANNED FACT FORMS at write-time + merge-time rewrite for historical data): - `scrub_deflection_sentences()`: drops whole sentences matching narrow regex patterns ("the assistant could not...", "offered to search...", "lacks access...", etc.). Keeps the original if scrubbing would empty the row; an empty diary entry is worse than a slightly-leaky one. Idempotent. - Wired into `update_daily_conversation_summary` so every new write is cleaned before it lands in `conversation_summaries`. - `scrub_all_diary_summaries()`: bulk sweep over every row; same semantics, fail-open per row, streams privacy-safe events (counts only, never raw summary text). - `POST /api/diary/scrub-deflections`: NDJSON-streaming endpoint backed by the bulk sweep. - Memory viewer: "Clean up deflection narration" button under a new Maintenance section in the diary sidebar. Modal copy is explicit about what is removed and what stays, so users do not worry about silent data loss. Spec updates: `summariser.spec.md` documents the post-process layer + UI; `docs/llm_contexts.md` notes the scrub in the summariser data flow; CLAUDE.md spec registry refreshed. Tests: 30 unit tests covering scrub patterns, edge cases (empty input, total-deflection rows, idempotence), DB integration of the sweep, streaming endpoint contract, and a privacy regression that asserts diary content cannot leak through the streaming UI. Eval extended with a defence-in-depth check that asserts post-scrub output is clean even when the live LLM leaks. Closes #259 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(memory): wire diary clean button on page load + harden sweep semantics Field bug: clicking "Clean up deflection narration" did nothing. The handler was wired inside `initGraph()`, which only runs when the user opens the Knowledge tab. Diary is the default tab, so a user who never visited Knowledge first never had the click handler attached. Wires the handler in the always-run page setup section. Adds a regression test that asserts the wiring is structurally outside `initGraph()` so a future refactor cannot reintroduce the bug. While here, addresses the verified findings from the multi-agent review: - Preserve `ts_utc` on bulk-sweep writes. `db.upsert_conversation_summary` gains an optional `ts_utc` parameter; the sweep passes through the row's original write time so a maintenance pass cannot stomp the audit trail (every cleaned row would otherwise look as though it had been written today). - Refresh the vector embedding inline when the bulk sweep rewrites a row. Without this the embedding stays anchored to the pre-scrub text and vector search drifts from FTS results. Best-effort: an embedding service failure is logged but does not roll back the summary write. When the caller has no embed model configured (offline-only setup), the sweep skips re-embedding and reports `embedding_refreshed=False`. - Surface only the exception class name through the streaming UI's `error` field (and the modal's status copy on terminal errors). A stringified Python exception can echo offending input back to the caller; class names cannot. - Construct modal log entries with `textContent` on a real DOM node rather than `innerHTML +=`. Defence-in-depth: even with the privacy whitelist, no path should be able to inject markup into the modal log. - Close the database connection in `finally` so a mid-iteration exception cannot orphan it until GC. - Lock the streaming progress event shape behind a whitelist test (`test_progress_event_keys_are_a_known_whitelist`). Any future field added to the sweep's event dict that could carry summary text now trips this test, forcing deliberate review. Adds tests for the `stated…` and `indicated…` regex branches that had zero coverage, multi-line summaries with `\n` between sentences, and summaries without terminal punctuation. README troubleshooting entry for users hitting poisoned diary rows; spec updated with the audit-trail and embedding-refresh contracts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(memory): pin diary-scrub cache invariant + regression test Audit triggered by the hot-cache surface introduced in #281: does the deflection scrub need a listener-style invalidation hook similar to the graph's invalidate_warm_profile? Verified the answer is no, and pinned the contract so a future change that starts caching diary content has to opt in deliberately: - DialogueMemory's hot cache holds three values: warm_profile_block (graph-derived, not diary), the per-query router decision, and the per-query memory-extractor parameters. None contain diary text. - search_conversation_memory_by_keywords reads SQLite live on every enrichment-bearing turn; the engine never stashes diary entries across turns. - Concurrency between the sweep and an in-flight reply is handled by SQLite WAL: separate Database instances on different connections, WAL serialises writes against reads at the file level. Spec gains a "Cache invariant" paragraph documenting the rule and the counterpart relationship to the graph's listener pattern. Regression test seeds a row, populates the engine's hot cache with extractor params, runs the scrub without invalidating the cache, and asserts the follow-up search returns cleaned content. If a future change adds a diary-content cache without an invalidation path, this test still passes while the user observes stale results, so the spec note is the load-bearing piece; the test catches the obvious regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(planner): allow hyphens in tool-name head regex for MCP tools `_TOOL_NAME_HEAD_RE` only matched `[A-Za-z_][A-Za-z0-9_]*`, so MCP tool names like `chrome-devtools__navigate_page` were truncated at the hyphen to `chrome` — not in the catalogue. `tool_names_in_plan` returned no known names, `plan_has_unresolved_tool_steps` flipped to True, and the engine skipped the direct-exec path entirely. The small chat model then took the turn with a 3-tool catalogue, flaked on the dunder MCP name, and emitted nothing — producing the "Sorry, I had trouble processing that" fallback for any "navigate to <site>" command. Widening the character class to include `-` lets the regex extract the full hyphenated name so direct-exec fires as designed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(router): don't cache fall-open fallback; use keyword scoring PR #281 added a conversation-scoped cache for the tool router output so identical follow-ups skip the LLM call. The cache stores whatever `select_tools` returns — including the "fall open to all tools" fallback that fires when the router LLM times out, returns empty, or emits text with no recognised tool name. One small-model fluke now pins `allowed_tools` to the full ~41-tool catalogue for the rest of the session, the planner gets overwhelmed and emits prose instead of naming a tool, and the small chat model produces an empty reply ("Sorry, I had trouble processing that") on every subsequent turn. Two changes: 1. Engine refuses to cache the fall-open. When `routed_tools` equals the full catalogue (set equality, not just length), skip the cache write so the next turn re-rolls the router. Re-rolling is cheap and almost always recovers. 2. `_select_llm` now falls back to the keyword strategy on timeout, empty response, or parse failure — instead of returning every tool. Keyword scoring matches "navigate" → `chrome-devtools__ navigate_page` deterministically, and the engine's `toolSearchTool` escape hatch still lets the chat model widen mid-loop if keyword missed. Specs (`selection.spec.md`, `reply.spec.md`) updated to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

isair force-pushed the claude/peaceful-agnesi-6ff1d7 branch 2 times, most recently from f06546c to efc3477 Compare April 26, 2026 16:46

isair changed the title ~~perf(reply): carry tool results across turns, gate redundant recall~~ perf(reply): hot-window carryover, gate, scratch cache Apr 26, 2026

isair and others added 2 commits April 27, 2026 00:25

isair force-pushed the claude/peaceful-agnesi-6ff1d7 branch from bf0d26e to 0bea96b Compare April 26, 2026 23:25

isair and others added 3 commits April 27, 2026 21:17

isair merged commit 94a16ec into develop May 2, 2026
2 checks passed

isair deleted the claude/peaceful-agnesi-6ff1d7 branch May 2, 2026 17:40

isair mentioned this pull request May 2, 2026

fix(memory): deterministic deflection scrub for diary writes + bulk clean button #323

Merged

6 tasks

isair mentioned this pull request May 3, 2026

fix(reply): unblock MCP tool routing on small models #351

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(reply): hot-window carryover, gate, scratch cache#281

perf(reply): hot-window carryover, gate, scratch cache#281
isair merged 5 commits intodevelopfrom
claude/peaceful-agnesi-6ff1d7

isair commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

isair commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this shape

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isair commented Apr 24, 2026 •

edited

Loading