perf(reply): hot-window carryover, gate, scratch cache#281
Merged
Conversation
f06546c to
efc3477
Compare
When a follow-up turn lands within the hot window and the prior reply
already pulled fresh data via a tool, re-running webSearch / diary /
graph wastes calls. Two cheap, deterministic mechanisms address this:
- DialogueMemory.record_tool_turn / get_recent_turns_with_tools persist
in-loop tool-call + tool-result messages so the next turn sees them
as prior context. Capped by tool_carryover_max_turns and
tool_carryover_per_entry_chars; secrets scrubbed; UNTRUSTED WEB
EXTRACT fence markers preserved on truncation; cleared on stop.
- recall_gate.should_recall: language-agnostic content-word coverage
heuristic (\\w{3,} with re.UNICODE). Skips diary/graph enrichment
only when the hot window already contains a fresh tool-result row
covering >=50% of the query. Fail-open on any error.
Wires into engine.run_reply_engine via a hasattr-guarded fetch, an
idempotent _maybe_record_tool_carryover helper called on success and
error paths, and a clear_tool_carryover on the stop-signal path.
Adds cfg fields, settings UI entries, spec + llm_contexts updates,
and 19 new tests covering carryover, scrubbing, fence preservation,
stop-clear, text-tool fallback, and Cyrillic recall coverage.
Layer three optimisations on top of the carryover + recall gate so the
warm-profile, extractor, and router paths each pay at most once per hot
window, and tighten privacy + bounds along the way.
- DialogueMemory.hot_cache_get/put/clear: per-conversation scratch cache
bounded by RECENT_WINDOW_SEC, used by the engine to memoise:
- warm profile block (query-agnostic SQLite traversal)
- memory enrichment extractor LLM call
- tool router LLM call (catalogue signature in key invalidates on
mid-window MCP refresh)
Cleared on the stop signal alongside tool carryover; auto-expires with
the hot window.
- Recall gate is now bypassed when the planner explicitly emits a
searchMemory step. Planner intent always wins over coverage heuristics
(C1 fix, was silently dropping memory work the planner asked for).
- record_tool_turn scrubs outside the lock and bounds storage to
_tool_turns_max_storage = 16 as a backstop against unbounded growth in
pathological sessions.
- Extended redact() vendor and keyword patterns: AWS AKIA/ASIA, Stripe
sk/pk/rk_(live|test)_, GitHub PATs, OpenAI sk-, Google AIza,
Authorization Bearer/Basic, plus refresh/access/id/oauth tokens and
session ids. Closes the gap where carryover payloads or recall-gate
debug logs could leak structurally-detectable credentials.
- Adds recall_gate.spec.md (scope, heuristic, language-agnostic design,
privacy, fail-open) and registers it in CLAUDE.md.
- Tests: tests/test_dialogue_memory_hot_cache.py (cache primitives,
is_tool_message, _tool_turns cap), tests/test_redact_extended.py
(vendor + Authorization + keyword patterns),
tests/test_engine_hot_window_caches.py (warm profile / extractor /
router cache hits across two turns; planner overrides recall gate).
- Updates docs/llm_contexts.md and reply.spec.md to document the cache
primitive, planner precedence over the gate, and new redact coverage.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
bf0d26e to
0bea96b
Compare
…validation
Extends the reply-engine scratch cache lifetime from 5-minute hot-window
to the full active conversation, with proper invalidation hooks so the
cached warm profile and tool router output stay correct as the
conversation grows.
DialogueMemory:
- hot_cache_get/put no longer expire entries by RECENT_WINDOW_SEC age.
Entries persist until clear_hot_cache(), invalidate_warm_profile(),
or new-conversation reset in the engine.
- _tool_turns no longer pruned by age either; bounded by hard storage
cap (16 turns) and cleared on new-conversation entry / stop.
- Adds WARM_PROFILE_CACHE_KEY constant so the engine and the
graph-mutation invalidator share a single key.
- Adds invalidate_warm_profile() that drops only that one entry.
- Scrubs tool_calls.arguments (dict and JSON-string forms) so re-
injection of the assistant tool_calls row on the next turn cannot
leak PII captured in query args.
- _next_ts() helper guarantees strictly-monotonic timestamps for
_messages and _tool_turns, fixing interleave ordering on Windows
where time.time() has ~16ms granularity.
graph.py:
- Adds register_graph_mutation_listener / unregister registry plus
_resolve_branch helper (parent-walk to FIXED_BRANCH ancestor with
depth cap). create_node / update_node / delete_node now notify
listeners with action + node_id + branch attribution. delete
resolves branch BEFORE the row is gone.
reply/engine.py:
- New-conversation entry (has_recent_messages was False) now wipes
hot_cache and tool carryover before the turn runs, so stale state
from a lapsed session can't leak into a fresh one.
- Warm-profile cache lookup uses the shared cache key constant.
daemon.py:
- After DialogueMemory init, registers a graph mutation listener
filtering on {BRANCH_USER, BRANCH_DIRECTIVES} that calls
invalidate_warm_profile() on the global DialogueMemory. World-
branch writes are noise for the warm profile and are ignored.
Specs (reply, graph, llm_contexts) updated to reflect conversation-
scoped cache lifetime and the new invalidation hooks.
Tests: new tests/test_graph_mutation_listener.py covers the registry
+ branch attribution + warm-profile invalidation hook end-to-end.
test_dialogue_memory_hot_cache, test_dialogue_memory_tool_carryover,
and test_engine_hot_window_caches updated for new lifetime contract,
tool_call arg scrubbing, and new-conversation cache wipe.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses issues raised in review of a48be9e: - _scrub_tool_call now also scrubs list/tuple-form arguments via a recursive _scrub_args helper. Closes the privacy gap where a non-standard provider emitting positional args could leak secrets on tool re-injection. - daemon.py reads _global_dialogue_memory through the module global at fire time instead of capturing it by closure, so a future singleton swap (tests, hot reload) routes invalidation to the current instance. Tracks the registered listener so it can be unregistered on shutdown and on re-entry, preventing stale closures from accumulating in the module-level registry. - _next_ts() docstring now explains the Windows ~16ms granularity motivation and the lock requirement. - Tests: * direct unit tests for _next_ts() (consecutive monotonic, advance past artificially high _last_ts). * _resolve_branch returns None past MAX_TRAVERSAL_DEPTH and on unknown node ids. * mutation listener is NOT called when create_node raises (no spurious events on failed writes). * tool_call arguments scrubbed when passed as list of mixed scalars and dicts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Python requires all global statements to precede any use of the variable in the same function scope. The two inner global _warm_profile_graph_listener declarations inside try/finally blocks were triggering a SyntaxError on startup. Consolidate the declaration at the top of main() alongside the other module globals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6 tasks
isair
added a commit
that referenced
this pull request
May 2, 2026
Audit triggered by the hot-cache surface introduced in #281: does the deflection scrub need a listener-style invalidation hook similar to the graph's invalidate_warm_profile? Verified the answer is no, and pinned the contract so a future change that starts caching diary content has to opt in deliberately: - DialogueMemory's hot cache holds three values: warm_profile_block (graph-derived, not diary), the per-query router decision, and the per-query memory-extractor parameters. None contain diary text. - search_conversation_memory_by_keywords reads SQLite live on every enrichment-bearing turn; the engine never stashes diary entries across turns. - Concurrency between the sweep and an in-flight reply is handled by SQLite WAL: separate Database instances on different connections, WAL serialises writes against reads at the file level. Spec gains a "Cache invariant" paragraph documenting the rule and the counterpart relationship to the graph's listener pattern. Regression test seeds a row, populates the engine's hot cache with extractor params, runs the scrub without invalidating the cache, and asserts the follow-up search returns cleaned content. If a future change adds a diary-content cache without an invalidation path, this test still passes while the user observes stale results, so the spec note is the load-bearing piece; the test catches the obvious regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
isair
added a commit
that referenced
this pull request
May 2, 2026
…lean button (#323) * fix(memory): deterministic deflection scrub for diary writes + bulk clean button The summariser prompt rule against narrating assistant failures (added in #232) reduces but does not eliminate the leak on small models. Field measurement on a real diary showed roughly 40% of post-rule writes on gemma4:e2b still contained banned phrasing, and pre-rule rows remained poisoned with no way to clean them. Adds a deterministic safety net mirroring the knowledge graph's two-layer defence (extractor BANNED FACT FORMS at write-time + merge-time rewrite for historical data): - `scrub_deflection_sentences()`: drops whole sentences matching narrow regex patterns ("the assistant could not...", "offered to search...", "lacks access...", etc.). Keeps the original if scrubbing would empty the row; an empty diary entry is worse than a slightly-leaky one. Idempotent. - Wired into `update_daily_conversation_summary` so every new write is cleaned before it lands in `conversation_summaries`. - `scrub_all_diary_summaries()`: bulk sweep over every row; same semantics, fail-open per row, streams privacy-safe events (counts only, never raw summary text). - `POST /api/diary/scrub-deflections`: NDJSON-streaming endpoint backed by the bulk sweep. - Memory viewer: "Clean up deflection narration" button under a new Maintenance section in the diary sidebar. Modal copy is explicit about what is removed and what stays, so users do not worry about silent data loss. Spec updates: `summariser.spec.md` documents the post-process layer + UI; `docs/llm_contexts.md` notes the scrub in the summariser data flow; CLAUDE.md spec registry refreshed. Tests: 30 unit tests covering scrub patterns, edge cases (empty input, total-deflection rows, idempotence), DB integration of the sweep, streaming endpoint contract, and a privacy regression that asserts diary content cannot leak through the streaming UI. Eval extended with a defence-in-depth check that asserts post-scrub output is clean even when the live LLM leaks. Closes #259 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(memory): wire diary clean button on page load + harden sweep semantics Field bug: clicking "Clean up deflection narration" did nothing. The handler was wired inside `initGraph()`, which only runs when the user opens the Knowledge tab. Diary is the default tab, so a user who never visited Knowledge first never had the click handler attached. Wires the handler in the always-run page setup section. Adds a regression test that asserts the wiring is structurally outside `initGraph()` so a future refactor cannot reintroduce the bug. While here, addresses the verified findings from the multi-agent review: - Preserve `ts_utc` on bulk-sweep writes. `db.upsert_conversation_summary` gains an optional `ts_utc` parameter; the sweep passes through the row's original write time so a maintenance pass cannot stomp the audit trail (every cleaned row would otherwise look as though it had been written today). - Refresh the vector embedding inline when the bulk sweep rewrites a row. Without this the embedding stays anchored to the pre-scrub text and vector search drifts from FTS results. Best-effort: an embedding service failure is logged but does not roll back the summary write. When the caller has no embed model configured (offline-only setup), the sweep skips re-embedding and reports `embedding_refreshed=False`. - Surface only the exception class name through the streaming UI's `error` field (and the modal's status copy on terminal errors). A stringified Python exception can echo offending input back to the caller; class names cannot. - Construct modal log entries with `textContent` on a real DOM node rather than `innerHTML +=`. Defence-in-depth: even with the privacy whitelist, no path should be able to inject markup into the modal log. - Close the database connection in `finally` so a mid-iteration exception cannot orphan it until GC. - Lock the streaming progress event shape behind a whitelist test (`test_progress_event_keys_are_a_known_whitelist`). Any future field added to the sweep's event dict that could carry summary text now trips this test, forcing deliberate review. Adds tests for the `stated…` and `indicated…` regex branches that had zero coverage, multi-line summaries with `\n` between sentences, and summaries without terminal punctuation. README troubleshooting entry for users hitting poisoned diary rows; spec updated with the audit-trail and embedding-refresh contracts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(memory): pin diary-scrub cache invariant + regression test Audit triggered by the hot-cache surface introduced in #281: does the deflection scrub need a listener-style invalidation hook similar to the graph's invalidate_warm_profile? Verified the answer is no, and pinned the contract so a future change that starts caching diary content has to opt in deliberately: - DialogueMemory's hot cache holds three values: warm_profile_block (graph-derived, not diary), the per-query router decision, and the per-query memory-extractor parameters. None contain diary text. - search_conversation_memory_by_keywords reads SQLite live on every enrichment-bearing turn; the engine never stashes diary entries across turns. - Concurrency between the sweep and an in-flight reply is handled by SQLite WAL: separate Database instances on different connections, WAL serialises writes against reads at the file level. Spec gains a "Cache invariant" paragraph documenting the rule and the counterpart relationship to the graph's listener pattern. Regression test seeds a row, populates the engine's hot cache with extractor params, runs the scrub without invalidating the cache, and asserts the follow-up search returns cleaned content. If a future change adds a diary-content cache without an invalidation path, this test still passes while the user observes stale results, so the spec note is the load-bearing piece; the test catches the obvious regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
isair
added a commit
that referenced
this pull request
May 3, 2026
* fix(planner): allow hyphens in tool-name head regex for MCP tools `_TOOL_NAME_HEAD_RE` only matched `[A-Za-z_][A-Za-z0-9_]*`, so MCP tool names like `chrome-devtools__navigate_page` were truncated at the hyphen to `chrome` — not in the catalogue. `tool_names_in_plan` returned no known names, `plan_has_unresolved_tool_steps` flipped to True, and the engine skipped the direct-exec path entirely. The small chat model then took the turn with a 3-tool catalogue, flaked on the dunder MCP name, and emitted nothing — producing the "Sorry, I had trouble processing that" fallback for any "navigate to <site>" command. Widening the character class to include `-` lets the regex extract the full hyphenated name so direct-exec fires as designed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(router): don't cache fall-open fallback; use keyword scoring PR #281 added a conversation-scoped cache for the tool router output so identical follow-ups skip the LLM call. The cache stores whatever `select_tools` returns — including the "fall open to all tools" fallback that fires when the router LLM times out, returns empty, or emits text with no recognised tool name. One small-model fluke now pins `allowed_tools` to the full ~41-tool catalogue for the rest of the session, the planner gets overwhelmed and emits prose instead of naming a tool, and the small chat model produces an empty reply ("Sorry, I had trouble processing that") on every subsequent turn. Two changes: 1. Engine refuses to cache the fall-open. When `routed_tools` equals the full catalogue (set equality, not just length), skip the cache write so the next turn re-rolls the router. Re-rolling is cheap and almost always recovers. 2. `_select_llm` now falls back to the keyword strategy on timeout, empty response, or parse failure — instead of returning every tool. Keyword scoring matches "navigate" → `chrome-devtools__ navigate_page` deterministically, and the engine's `toolSearchTool` escape hatch still lets the chat model widen mid-loop if keyword missed. Specs (`selection.spec.md`, `reply.spec.md`) updated to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cuts wasted LLM and SQLite work when a conversation stays in the same hot window. Targets the current architecture (warm profile + planner-first routing + optional memory recall), not the pre-rebase one.
DialogueMemorynow persists in-loop assistant-with-tool_calls + tool-role messages; the engine re-injects them on the next turn so follow-ups synthesise from prior results instead of re-fetching. Capped viacfg.tool_carryover_max_turns/tool_carryover_per_entry_chars, scrubbed of secrets, fence markers preserved on truncation, cleared onstop, excluded from diary input. Hard storage backstop (_tool_turns_max_storage = 16) so a pathological session cannot grow the list unbounded.\w{3,}+re.UNICODE, no LLM, language-agnostic) that skips diary + graph + memory-digest enrichment when the hot window already grounds a follow-up. Fail-open. Bypassed when the planner explicitly emitssearchMemoryso planner intent always wins (C1 fix). Spec atsrc/jarvis/memory/recall_gate.spec.md.DialogueMemory— one primitive (hot_cache_get/put/clear, auto-expires withRECENT_WINDOW_SEC) memoises three idempotent per-turn computations within one hot window:Cleared on
stopalongside tool carryover.redact()patterns so debug logs and persisted tool payloads cannot leak credentials: AWS access keys (AKIA / ASIA), Stripesk/pk/rk_(live|test)_…, GitHub PATs, OpenAIsk-…, GoogleAIza…,Authorization: Bearer/Basic …, plus refresh / access / id / oauth tokens and session ids.Why this shape
The earlier architecture threw away tool messages at every turn-end and ran diary/graph recall unconditionally. The current architecture (warm profile + planner gating recall + warm chains for small models) closed the second hole; the first was still there, and three caches were paying for the same value on every follow-up. This PR closes the carryover hole and adds the cache primitive so the planner / router / extractor / warm-profile paths each pay at most once per hot window.
Test plan
tests/test_dialogue_memory_tool_carryover.py(10 tests): record/retrieve, interleaving, truncation, max-turn cap, clear, expiry, secret scrub, fence-end preserved on truncation, diary isolationtests/test_dialogue_memory_hot_cache.py(new): get/put/expiry/clear;_tool_turnsstorage cap;is_tool_messagehelper across native, text-tool, and assistant-tool_calls shapestests/test_recall_gate.py(existing) +tests/test_engine_hot_window_caches.py(new) — covers planner precedence over the gate (C1 fix) and one-call-per-conversation behaviour for the warm profile, extractor, and router cachestests/test_redact_extended.py(new): vendor access keys, Authorization headers, keyword-anchored credential variantstests/test_engine_tool_carryover.py(existing): end-to-end two-turn integration, native + text-tool paths, stop-signal clears carryoverpytest tests/test_engine_tool_carryover.py tests/test_recall_gate.py tests/test_dialogue_memory_tool_carryover.py tests/test_dialogue_memory_hot_cache.py tests/test_redact_extended.py tests/test_engine_hot_window_caches.py tests/test_engine_planner_integration.py→ 55 passedwebSearchfires across a 3-turn entity follow-up sequencesrc/jarvis/reply/reply.spec.md,docs/llm_contexts.md,src/jarvis/memory/recall_gate.spec.md(new),CLAUDE.mdregistry🤖 Generated with Claude Code