Skip to content

perf(reply): hot-window carryover, gate, scratch cache#281

Merged
isair merged 5 commits intodevelopfrom
claude/peaceful-agnesi-6ff1d7
May 2, 2026
Merged

perf(reply): hot-window carryover, gate, scratch cache#281
isair merged 5 commits intodevelopfrom
claude/peaceful-agnesi-6ff1d7

Conversation

@isair
Copy link
Copy Markdown
Owner

@isair isair commented Apr 24, 2026

Summary

Cuts wasted LLM and SQLite work when a conversation stays in the same hot window. Targets the current architecture (warm profile + planner-first routing + optional memory recall), not the pre-rebase one.

  • Tool-result carryover: DialogueMemory now persists in-loop assistant-with-tool_calls + tool-role messages; the engine re-injects them on the next turn so follow-ups synthesise from prior results instead of re-fetching. Capped via cfg.tool_carryover_max_turns / tool_carryover_per_entry_chars, scrubbed of secrets, fence markers preserved on truncation, cleared on stop, excluded from diary input. Hard storage backstop (_tool_turns_max_storage = 16) so a pathological session cannot grow the list unbounded.
  • Recall gate: deterministic keyword-coverage heuristic (\w{3,} + re.UNICODE, no LLM, language-agnostic) that skips diary + graph + memory-digest enrichment when the hot window already grounds a follow-up. Fail-open. Bypassed when the planner explicitly emits searchMemory so planner intent always wins (C1 fix). Spec at src/jarvis/memory/recall_gate.spec.md.
  • Hot-window scratch cache on DialogueMemory — one primitive (hot_cache_get/put/clear, auto-expires with RECENT_WINDOW_SEC) memoises three idempotent per-turn computations within one hot window:
    • Warm profile block (skips the User + Directives SQLite traversal on every follow-up).
    • Memory enrichment extractor (skips the small-model LLM call when the same query repeats).
    • Tool router output (skips the router LLM call when the query and tool catalogue match; catalogue signature invalidates on MCP refresh).
      Cleared on stop alongside tool carryover.
  • Extended redact() patterns so debug logs and persisted tool payloads cannot leak credentials: AWS access keys (AKIA / ASIA), Stripe sk/pk/rk_(live|test)_…, GitHub PATs, OpenAI sk-…, Google AIza…, Authorization: Bearer/Basic …, plus refresh / access / id / oauth tokens and session ids.

Why this shape

The earlier architecture threw away tool messages at every turn-end and ran diary/graph recall unconditionally. The current architecture (warm profile + planner gating recall + warm chains for small models) closed the second hole; the first was still there, and three caches were paying for the same value on every follow-up. This PR closes the carryover hole and adds the cache primitive so the planner / router / extractor / warm-profile paths each pay at most once per hot window.

Test plan

  • tests/test_dialogue_memory_tool_carryover.py (10 tests): record/retrieve, interleaving, truncation, max-turn cap, clear, expiry, secret scrub, fence-end preserved on truncation, diary isolation
  • tests/test_dialogue_memory_hot_cache.py (new): get/put/expiry/clear; _tool_turns storage cap; is_tool_message helper across native, text-tool, and assistant-tool_calls shapes
  • tests/test_recall_gate.py (existing) + tests/test_engine_hot_window_caches.py (new) — covers planner precedence over the gate (C1 fix) and one-call-per-conversation behaviour for the warm profile, extractor, and router caches
  • tests/test_redact_extended.py (new): vendor access keys, Authorization headers, keyword-anchored credential variants
  • tests/test_engine_tool_carryover.py (existing): end-to-end two-turn integration, native + text-tool paths, stop-signal clears carryover
  • All focused tests pass: pytest tests/test_engine_tool_carryover.py tests/test_recall_gate.py tests/test_dialogue_memory_tool_carryover.py tests/test_dialogue_memory_hot_cache.py tests/test_redact_extended.py tests/test_engine_hot_window_caches.py tests/test_engine_planner_integration.py → 55 passed
  • Manual voice-mode smoke test: confirm only one webSearch fires across a 3-turn entity follow-up sequence
  • Specs updated: src/jarvis/reply/reply.spec.md, docs/llm_contexts.md, src/jarvis/memory/recall_gate.spec.md (new), CLAUDE.md registry

🤖 Generated with Claude Code

@isair isair force-pushed the claude/peaceful-agnesi-6ff1d7 branch 2 times, most recently from f06546c to efc3477 Compare April 26, 2026 16:46
@isair isair changed the title perf(reply): carry tool results across turns, gate redundant recall perf(reply): hot-window carryover, gate, scratch cache Apr 26, 2026
isair and others added 2 commits April 27, 2026 00:25
When a follow-up turn lands within the hot window and the prior reply
already pulled fresh data via a tool, re-running webSearch / diary /
graph wastes calls. Two cheap, deterministic mechanisms address this:

- DialogueMemory.record_tool_turn / get_recent_turns_with_tools persist
  in-loop tool-call + tool-result messages so the next turn sees them
  as prior context. Capped by tool_carryover_max_turns and
  tool_carryover_per_entry_chars; secrets scrubbed; UNTRUSTED WEB
  EXTRACT fence markers preserved on truncation; cleared on stop.
- recall_gate.should_recall: language-agnostic content-word coverage
  heuristic (\\w{3,} with re.UNICODE). Skips diary/graph enrichment
  only when the hot window already contains a fresh tool-result row
  covering >=50% of the query. Fail-open on any error.

Wires into engine.run_reply_engine via a hasattr-guarded fetch, an
idempotent _maybe_record_tool_carryover helper called on success and
error paths, and a clear_tool_carryover on the stop-signal path.

Adds cfg fields, settings UI entries, spec + llm_contexts updates,
and 19 new tests covering carryover, scrubbing, fence preservation,
stop-clear, text-tool fallback, and Cyrillic recall coverage.
Layer three optimisations on top of the carryover + recall gate so the
warm-profile, extractor, and router paths each pay at most once per hot
window, and tighten privacy + bounds along the way.

- DialogueMemory.hot_cache_get/put/clear: per-conversation scratch cache
  bounded by RECENT_WINDOW_SEC, used by the engine to memoise:
    - warm profile block (query-agnostic SQLite traversal)
    - memory enrichment extractor LLM call
    - tool router LLM call (catalogue signature in key invalidates on
      mid-window MCP refresh)
  Cleared on the stop signal alongside tool carryover; auto-expires with
  the hot window.

- Recall gate is now bypassed when the planner explicitly emits a
  searchMemory step. Planner intent always wins over coverage heuristics
  (C1 fix, was silently dropping memory work the planner asked for).

- record_tool_turn scrubs outside the lock and bounds storage to
  _tool_turns_max_storage = 16 as a backstop against unbounded growth in
  pathological sessions.

- Extended redact() vendor and keyword patterns: AWS AKIA/ASIA, Stripe
  sk/pk/rk_(live|test)_, GitHub PATs, OpenAI sk-, Google AIza,
  Authorization Bearer/Basic, plus refresh/access/id/oauth tokens and
  session ids. Closes the gap where carryover payloads or recall-gate
  debug logs could leak structurally-detectable credentials.

- Adds recall_gate.spec.md (scope, heuristic, language-agnostic design,
  privacy, fail-open) and registers it in CLAUDE.md.

- Tests: tests/test_dialogue_memory_hot_cache.py (cache primitives,
  is_tool_message, _tool_turns cap), tests/test_redact_extended.py
  (vendor + Authorization + keyword patterns),
  tests/test_engine_hot_window_caches.py (warm profile / extractor /
  router cache hits across two turns; planner overrides recall gate).

- Updates docs/llm_contexts.md and reply.spec.md to document the cache
  primitive, planner precedence over the gate, and new redact coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@isair isair force-pushed the claude/peaceful-agnesi-6ff1d7 branch from bf0d26e to 0bea96b Compare April 26, 2026 23:25
isair and others added 3 commits April 27, 2026 21:17
…validation

Extends the reply-engine scratch cache lifetime from 5-minute hot-window
to the full active conversation, with proper invalidation hooks so the
cached warm profile and tool router output stay correct as the
conversation grows.

DialogueMemory:
- hot_cache_get/put no longer expire entries by RECENT_WINDOW_SEC age.
  Entries persist until clear_hot_cache(), invalidate_warm_profile(),
  or new-conversation reset in the engine.
- _tool_turns no longer pruned by age either; bounded by hard storage
  cap (16 turns) and cleared on new-conversation entry / stop.
- Adds WARM_PROFILE_CACHE_KEY constant so the engine and the
  graph-mutation invalidator share a single key.
- Adds invalidate_warm_profile() that drops only that one entry.
- Scrubs tool_calls.arguments (dict and JSON-string forms) so re-
  injection of the assistant tool_calls row on the next turn cannot
  leak PII captured in query args.
- _next_ts() helper guarantees strictly-monotonic timestamps for
  _messages and _tool_turns, fixing interleave ordering on Windows
  where time.time() has ~16ms granularity.

graph.py:
- Adds register_graph_mutation_listener / unregister registry plus
  _resolve_branch helper (parent-walk to FIXED_BRANCH ancestor with
  depth cap). create_node / update_node / delete_node now notify
  listeners with action + node_id + branch attribution. delete
  resolves branch BEFORE the row is gone.

reply/engine.py:
- New-conversation entry (has_recent_messages was False) now wipes
  hot_cache and tool carryover before the turn runs, so stale state
  from a lapsed session can't leak into a fresh one.
- Warm-profile cache lookup uses the shared cache key constant.

daemon.py:
- After DialogueMemory init, registers a graph mutation listener
  filtering on {BRANCH_USER, BRANCH_DIRECTIVES} that calls
  invalidate_warm_profile() on the global DialogueMemory. World-
  branch writes are noise for the warm profile and are ignored.

Specs (reply, graph, llm_contexts) updated to reflect conversation-
scoped cache lifetime and the new invalidation hooks.

Tests: new tests/test_graph_mutation_listener.py covers the registry
+ branch attribution + warm-profile invalidation hook end-to-end.
test_dialogue_memory_hot_cache, test_dialogue_memory_tool_carryover,
and test_engine_hot_window_caches updated for new lifetime contract,
tool_call arg scrubbing, and new-conversation cache wipe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses issues raised in review of a48be9e:

- _scrub_tool_call now also scrubs list/tuple-form arguments via a
  recursive _scrub_args helper. Closes the privacy gap where a
  non-standard provider emitting positional args could leak secrets
  on tool re-injection.

- daemon.py reads _global_dialogue_memory through the module global
  at fire time instead of capturing it by closure, so a future
  singleton swap (tests, hot reload) routes invalidation to the
  current instance. Tracks the registered listener so it can be
  unregistered on shutdown and on re-entry, preventing stale
  closures from accumulating in the module-level registry.

- _next_ts() docstring now explains the Windows ~16ms granularity
  motivation and the lock requirement.

- Tests:
  * direct unit tests for _next_ts() (consecutive monotonic, advance
    past artificially high _last_ts).
  * _resolve_branch returns None past MAX_TRAVERSAL_DEPTH and on
    unknown node ids.
  * mutation listener is NOT called when create_node raises (no
    spurious events on failed writes).
  * tool_call arguments scrubbed when passed as list of mixed
    scalars and dicts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Python requires all global statements to precede any use of the
variable in the same function scope. The two inner global
_warm_profile_graph_listener declarations inside try/finally blocks
were triggering a SyntaxError on startup. Consolidate the declaration
at the top of main() alongside the other module globals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@isair isair merged commit 94a16ec into develop May 2, 2026
2 checks passed
@isair isair deleted the claude/peaceful-agnesi-6ff1d7 branch May 2, 2026 17:40
isair added a commit that referenced this pull request May 2, 2026
Audit triggered by the hot-cache surface introduced in #281: does the
deflection scrub need a listener-style invalidation hook similar to the
graph's invalidate_warm_profile?

Verified the answer is no, and pinned the contract so a future change
that starts caching diary content has to opt in deliberately:

- DialogueMemory's hot cache holds three values: warm_profile_block
  (graph-derived, not diary), the per-query router decision, and the
  per-query memory-extractor parameters. None contain diary text.
- search_conversation_memory_by_keywords reads SQLite live on every
  enrichment-bearing turn; the engine never stashes diary entries
  across turns.
- Concurrency between the sweep and an in-flight reply is handled by
  SQLite WAL: separate Database instances on different connections,
  WAL serialises writes against reads at the file level.

Spec gains a "Cache invariant" paragraph documenting the rule and the
counterpart relationship to the graph's listener pattern. Regression
test seeds a row, populates the engine's hot cache with extractor
params, runs the scrub without invalidating the cache, and asserts the
follow-up search returns cleaned content. If a future change adds a
diary-content cache without an invalidation path, this test still
passes while the user observes stale results, so the spec note is the
load-bearing piece; the test catches the obvious regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
isair added a commit that referenced this pull request May 2, 2026
…lean button (#323)

* fix(memory): deterministic deflection scrub for diary writes + bulk clean button

The summariser prompt rule against narrating assistant failures (added in
#232) reduces but does not eliminate the leak on small models. Field
measurement on a real diary showed roughly 40% of post-rule writes on
gemma4:e2b still contained banned phrasing, and pre-rule rows remained
poisoned with no way to clean them.

Adds a deterministic safety net mirroring the knowledge graph's two-layer
defence (extractor BANNED FACT FORMS at write-time + merge-time rewrite
for historical data):

- `scrub_deflection_sentences()`: drops whole sentences matching narrow
  regex patterns ("the assistant could not...", "offered to search...",
  "lacks access...", etc.). Keeps the original if scrubbing would empty
  the row; an empty diary entry is worse than a slightly-leaky one.
  Idempotent.
- Wired into `update_daily_conversation_summary` so every new write is
  cleaned before it lands in `conversation_summaries`.
- `scrub_all_diary_summaries()`: bulk sweep over every row; same
  semantics, fail-open per row, streams privacy-safe events (counts only,
  never raw summary text).
- `POST /api/diary/scrub-deflections`: NDJSON-streaming endpoint backed
  by the bulk sweep.
- Memory viewer: "Clean up deflection narration" button under a new
  Maintenance section in the diary sidebar. Modal copy is explicit about
  what is removed and what stays, so users do not worry about silent
  data loss.

Spec updates: `summariser.spec.md` documents the post-process layer + UI;
`docs/llm_contexts.md` notes the scrub in the summariser data flow;
CLAUDE.md spec registry refreshed.

Tests: 30 unit tests covering scrub patterns, edge cases (empty input,
total-deflection rows, idempotence), DB integration of the sweep,
streaming endpoint contract, and a privacy regression that asserts
diary content cannot leak through the streaming UI. Eval extended with
a defence-in-depth check that asserts post-scrub output is clean even
when the live LLM leaks.

Closes #259

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(memory): wire diary clean button on page load + harden sweep semantics

Field bug: clicking "Clean up deflection narration" did nothing. The
handler was wired inside `initGraph()`, which only runs when the user
opens the Knowledge tab. Diary is the default tab, so a user who never
visited Knowledge first never had the click handler attached.

Wires the handler in the always-run page setup section. Adds a
regression test that asserts the wiring is structurally outside
`initGraph()` so a future refactor cannot reintroduce the bug.

While here, addresses the verified findings from the multi-agent review:

- Preserve `ts_utc` on bulk-sweep writes. `db.upsert_conversation_summary`
  gains an optional `ts_utc` parameter; the sweep passes through the
  row's original write time so a maintenance pass cannot stomp the
  audit trail (every cleaned row would otherwise look as though it had
  been written today).
- Refresh the vector embedding inline when the bulk sweep rewrites a
  row. Without this the embedding stays anchored to the pre-scrub text
  and vector search drifts from FTS results. Best-effort: an embedding
  service failure is logged but does not roll back the summary write.
  When the caller has no embed model configured (offline-only setup),
  the sweep skips re-embedding and reports `embedding_refreshed=False`.
- Surface only the exception class name through the streaming UI's
  `error` field (and the modal's status copy on terminal errors). A
  stringified Python exception can echo offending input back to the
  caller; class names cannot.
- Construct modal log entries with `textContent` on a real DOM node
  rather than `innerHTML +=`. Defence-in-depth: even with the privacy
  whitelist, no path should be able to inject markup into the modal
  log.
- Close the database connection in `finally` so a mid-iteration
  exception cannot orphan it until GC.
- Lock the streaming progress event shape behind a whitelist test
  (`test_progress_event_keys_are_a_known_whitelist`). Any future field
  added to the sweep's event dict that could carry summary text now
  trips this test, forcing deliberate review.

Adds tests for the `stated…` and `indicated…` regex branches that had
zero coverage, multi-line summaries with `\n` between sentences, and
summaries without terminal punctuation.

README troubleshooting entry for users hitting poisoned diary rows;
spec updated with the audit-trail and embedding-refresh contracts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(memory): pin diary-scrub cache invariant + regression test

Audit triggered by the hot-cache surface introduced in #281: does the
deflection scrub need a listener-style invalidation hook similar to the
graph's invalidate_warm_profile?

Verified the answer is no, and pinned the contract so a future change
that starts caching diary content has to opt in deliberately:

- DialogueMemory's hot cache holds three values: warm_profile_block
  (graph-derived, not diary), the per-query router decision, and the
  per-query memory-extractor parameters. None contain diary text.
- search_conversation_memory_by_keywords reads SQLite live on every
  enrichment-bearing turn; the engine never stashes diary entries
  across turns.
- Concurrency between the sweep and an in-flight reply is handled by
  SQLite WAL: separate Database instances on different connections,
  WAL serialises writes against reads at the file level.

Spec gains a "Cache invariant" paragraph documenting the rule and the
counterpart relationship to the graph's listener pattern. Regression
test seeds a row, populates the engine's hot cache with extractor
params, runs the scrub without invalidating the cache, and asserts the
follow-up search returns cleaned content. If a future change adds a
diary-content cache without an invalidation path, this test still
passes while the user observes stale results, so the spec note is the
load-bearing piece; the test catches the obvious regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
isair added a commit that referenced this pull request May 3, 2026
* fix(planner): allow hyphens in tool-name head regex for MCP tools

`_TOOL_NAME_HEAD_RE` only matched `[A-Za-z_][A-Za-z0-9_]*`, so MCP tool
names like `chrome-devtools__navigate_page` were truncated at the hyphen
to `chrome` — not in the catalogue. `tool_names_in_plan` returned no
known names, `plan_has_unresolved_tool_steps` flipped to True, and the
engine skipped the direct-exec path entirely. The small chat model then
took the turn with a 3-tool catalogue, flaked on the dunder MCP name,
and emitted nothing — producing the "Sorry, I had trouble processing
that" fallback for any "navigate to <site>" command.

Widening the character class to include `-` lets the regex extract the
full hyphenated name so direct-exec fires as designed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(router): don't cache fall-open fallback; use keyword scoring

PR #281 added a conversation-scoped cache for the tool router output
so identical follow-ups skip the LLM call. The cache stores whatever
`select_tools` returns — including the "fall open to all tools"
fallback that fires when the router LLM times out, returns empty, or
emits text with no recognised tool name. One small-model fluke now
pins `allowed_tools` to the full ~41-tool catalogue for the rest of
the session, the planner gets overwhelmed and emits prose instead of
naming a tool, and the small chat model produces an empty reply
("Sorry, I had trouble processing that") on every subsequent turn.

Two changes:

1. Engine refuses to cache the fall-open. When `routed_tools` equals
   the full catalogue (set equality, not just length), skip the cache
   write so the next turn re-rolls the router. Re-rolling is cheap
   and almost always recovers.

2. `_select_llm` now falls back to the keyword strategy on timeout,
   empty response, or parse failure — instead of returning every
   tool. Keyword scoring matches "navigate" → `chrome-devtools__
   navigate_page` deterministically, and the engine's `toolSearchTool`
   escape hatch still lets the chat model widen mid-loop if keyword
   missed.

Specs (`selection.spec.md`, `reply.spec.md`) updated to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant