isair · isair · May 2, 2026 · Apr 26, 2026 · Apr 26, 2026 · Apr 27, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -24,6 +24,7 @@ Any code change must either adhere to our spec files perfectly or you should ask
 | `src/jarvis/utils/location.spec.md` | GeoIP location detection | Privacy-first; local GeoLite2 DB only |
 | `src/jarvis/memory/graph.spec.md` | Node graph memory (v2), self-organising tree, UI explorer | Dynamic structure; access-aware; auto-split/merge (future) |
 | `src/jarvis/memory/summariser.spec.md` | Diary summariser prompt contract and hygiene rules (deflection, attribution, topic separation) | Summariser is the source; corrupted summaries poison every downstream consumer |
+| `src/jarvis/memory/recall_gate.spec.md` | Deterministic skip-enrichment heuristic when the hot window covers a follow-up | Fail-open; language-agnostic via `\w{3,}` + `re.UNICODE`; planner intent always wins |
 
 The LLM contexts graph at `docs/llm_contexts.md` maps every LLM call in the app (model, gating, inputs, outputs, limits, flow). Keep it up-to-date at all times: any change that adds, removes, or alters an LLM context (model resolution, timeout, cap, prompt source, gating flag, data-flow edge) must update `docs/llm_contexts.md` in the same PR.
 

diff --git a/docs/llm_contexts.md b/docs/llm_contexts.md
@@ -11,9 +11,9 @@ Every distinct LLM call in Jarvis, what feeds it, what consumes it, and how it i
 - **Model / gating**: `cfg.ollama_chat_model` (the big model). Not optional. No size branching on the loop itself — size branching affects the digests/evaluator around it.
 - **Inputs**:
   - Redacted user query
-  - Recent dialogue (last 5 minutes)
+  - Recent dialogue (last 5 minutes), including in-loop tool-call + tool-role messages from prior replies within the active conversation (tool carryover, `DialogueMemory.record_tool_turn` / `get_recent_turns_with_tools` in [src/jarvis/memory/conversation.py](src/jarvis/memory/conversation.py); per-prompt cap via `cfg.tool_carryover_max_turns` / `tool_carryover_per_entry_chars`; storage cap `_tool_turns_max_storage = 16`; cleared on `stop` signal AND on new-conversation entry; UNTRUSTED WEB EXTRACT fence markers preserved on truncation; both `content` and `tool_calls[*].function.arguments` scrubbed on write)
   - Unified system prompt from [src/jarvis/system_prompt.py](src/jarvis/system_prompt.py) + ASR note + tool-protocol guidance
-  - **Warm profile block** (query-agnostic User + Directives excerpt from the knowledge graph, composed by `build_warm_profile()` / `format_warm_profile_block()` in [src/jarvis/memory/graph_ops.py](src/jarvis/memory/graph_ops.py) at Step 3.5 of `reply()`; no LLM call, pure SQLite read; injected unconditionally so personalisation is the default)
+  - **Warm profile block** (query-agnostic User + Directives excerpt from the knowledge graph, composed by `build_warm_profile()` / `format_warm_profile_block()` in [src/jarvis/memory/graph_ops.py](src/jarvis/memory/graph_ops.py) at Step 3.5 of `reply()`; no LLM call, pure SQLite read; injected unconditionally so personalisation is the default; result cached in `DialogueMemory._hot_cache` under `DialogueMemory.WARM_PROFILE_CACHE_KEY` for the lifetime of the active conversation. Invalidated on `stop`, on new-conversation entry, AND on User/Directives graph mutations via the listener registered in [src/jarvis/daemon.py](src/jarvis/daemon.py) against `register_graph_mutation_listener` in [src/jarvis/memory/graph.py](src/jarvis/memory/graph.py); World-branch writes are ignored)
   - Digested memory enrichment (optional, see #4)
   - Time + location context (re-injected each turn)
   - Tool schema: native via `generate_tools_json_schema()` ([src/jarvis/tools/registry.py](src/jarvis/tools/registry.py)) or text fallback via `_text_tool_call_guidance()` ([engine.py:68](src/jarvis/reply/engine.py:68))
@@ -44,6 +44,17 @@ Every distinct LLM call in Jarvis, what feeds it, what consumes it, and how it i
 - **System prompt**: inline at [enrichment.py:35-63](src/jarvis/reply/enrichment.py:35).
 - **Output**: `{keywords, from?, to?, questions?}`. Consumed by memory search in the reply engine.
 - **Limits**: up to 2 retries; timeout from `llm_tools_timeout_sec`.
+- **Caching**: result cached in `DialogueMemory._hot_cache` under key `enrichment:{redacted_query[+topic_hint]}` for the lifetime of the active conversation. Identical follow-ups within the same conversation reuse the dict and skip the LLM hop. Cleared by `clear_hot_cache()` on the `stop` signal and on new-conversation entry.
+
+## 3b. Recall Gate (pre-enrichment short-circuit)
+
+- **File**: [src/jarvis/memory/recall_gate.py](src/jarvis/memory/recall_gate.py) — `should_recall()`.
+- **Trigger**: once per reply, before diary/graph/digest enrichment runs (after the planner has decided memory is potentially needed).
+- **Model / gating**: NO LLM — deterministic keyword-coverage heuristic. Cheap.
+- **Inputs**: query, recent dialogue (incl. tool carryover rows).
+- **Output**: `False` only if hot-window contains a fresh tool result AND ≥50% of the query's content words appear in the hot-window transcript → skips diary, graph, and memory digest for this reply. Else `True`. Fail-open on any exception. Content-word extraction uses `\w{3,}` with `re.UNICODE`, so the gate works for Latin, Cyrillic, CJK, Arabic, Hebrew, etc. (per CLAUDE.md "no hardcoded language patterns"). Overlap words are run through `redact()` before being written to debug logs.
+- **Planner precedence**: when the planner explicitly emitted a `searchMemory` step, the gate is bypassed — the planner has more signal than coverage and overriding it would silently drop intent. The gate only short-circuits the fail-open empty-plan path.
+- **Rationale**: prevents re-running diary/graph lookups when the hot window already grounds the follow-up (e.g. "his most famous song" after a Bieber webSearch).
 
 ## 4. Memory Digest (optional, SMALL models)
 
@@ -84,6 +95,7 @@ Every distinct LLM call in Jarvis, what feeds it, what consumes it, and how it i
 - **System prompt**: inline (~lines 260-315). Teaches pick up-to-5 tools or `none`.
 - **Output**: comma-separated tool names or `none`. Capped at `_LLM_MAX_SELECTED` (5). Always-included tools (`stop`, `toolSearchTool`) are unioned in regardless.
 - **Limits**: `llm_timeout_sec`. On failure → all tools.
+- **Caching**: `routed_tools` cached in `DialogueMemory._hot_cache` under key `router:{redacted_query}|{strategy}|{builtin-names}|{mcp-names}` for the lifetime of the active conversation. The catalogue signature lets a mid-conversation MCP refresh invalidate the cache; `context_hint` is intentionally excluded so time/location drift inside one conversation doesn't bust it. Cleared by `clear_hot_cache()` on the `stop` signal and on new-conversation entry.
 
 ## 8. Tool Searcher (mid-loop escape hatch)
 

diff --git a/src/desktop_app/settings_window.py b/src/desktop_app/settings_window.py
@@ -264,6 +264,12 @@ def f(key, label, desc, cat, ftype, **kw):
     f("memory_enrichment_source", "Enrichment Source",
       "Which memory system enriches replies: all (diary + graph), diary only, or graph only",
       "memory", "choice", choices=[("diary", "Diary only"), ("graph", "Graph only"), ("all", "All (diary + graph)")])
+    f("tool_carryover_max_turns", "Tool Carryover Turns",
+      "How many prior replies' tool results to keep visible for follow-up questions",
+      "memory", "int", min_val=0, max_val=10)
+    f("tool_carryover_per_entry_chars", "Tool Carryover Length",
+      "Chars kept per carried-over tool result (UNTRUSTED fence markers preserved)",
+      "memory", "int", min_val=200, max_val=8000, step=100)
     f("agentic_max_turns", "Agentic Max Turns",
       "Maximum turns in agentic tool-use loops",
       "memory", "int", min_val=1, max_val=30)

diff --git a/src/jarvis/config.py b/src/jarvis/config.py
@@ -169,6 +169,13 @@ class Settings:
     dialogue_memory_timeout: float
     memory_enrichment_max_results: int
     memory_enrichment_source: str  # "all", "diary", or "graph"
+    # Tool-call + tool-result messages from prior replies in the hot window
+    # are re-injected into the next turn so follow-ups can reuse them instead
+    # of re-fetching. These knobs cap how many prior tool turns survive and
+    # how much of each tool payload is retained (the fence markers of
+    # UNTRUSTED WEB EXTRACT blocks are preserved on truncation).
+    tool_carryover_max_turns: int
+    tool_carryover_per_entry_chars: int
     # Distil diary + graph into a short relevance-filtered note via a cheap
     # LLM pass before injecting into the reply system prompt. When None
     # (the default), it auto-enables for SMALL models (≤7B) and stays off
@@ -470,6 +477,9 @@ def get_default_config() -> Dict[str, Any]:
         "dialogue_memory_timeout": 300.0,
         "memory_enrichment_max_results": 3,
         "memory_enrichment_source": "all",  # "all", "diary", or "graph"
+        # Tool carryover: cap re-injected prior tool turns + chars per entry.
+        "tool_carryover_max_turns": 2,
+        "tool_carryover_per_entry_chars": 1200,
         # None = auto (on for small models ≤7B, off for large). Set true/false to force.
         "memory_digest_enabled": None,
         # Distil raw tool results (e.g. webSearch extracts) into a short
@@ -658,6 +668,8 @@ def load_settings() -> Settings:
     memory_enrichment_source = str(merged.get("memory_enrichment_source", "all")).lower()
     if memory_enrichment_source not in ("all", "diary", "graph"):
         memory_enrichment_source = "all"
+    tool_carryover_max_turns = max(0, int(merged.get("tool_carryover_max_turns", 2)))
+    tool_carryover_per_entry_chars = max(200, int(merged.get("tool_carryover_per_entry_chars", 1200)))
     _digest_raw = merged.get("memory_digest_enabled", None)
     memory_digest_enabled: Optional[bool]
     if _digest_raw is None:
@@ -818,6 +830,8 @@ def load_settings() -> Settings:
         dialogue_memory_timeout=dialogue_memory_timeout,
         memory_enrichment_max_results=memory_enrichment_max_results,
         memory_enrichment_source=memory_enrichment_source,
+        tool_carryover_max_turns=tool_carryover_max_turns,
+        tool_carryover_per_entry_chars=tool_carryover_per_entry_chars,
         memory_digest_enabled=memory_digest_enabled,
         tool_result_digest_enabled=tool_result_digest_enabled,
         agentic_max_turns=agentic_max_turns,

diff --git a/src/jarvis/daemon.py b/src/jarvis/daemon.py
@@ -44,6 +44,7 @@
 # Global instances for coordination between modules
 _global_dialogue_memory: Optional[DialogueMemory] = None
 _global_stop_requested: bool = False
+_warm_profile_graph_listener = None  # registered callback, kept for shutdown unregister
 _global_tts_engine = None  # TTS engine reference for face animation polling
 _global_dictation_engine = None  # Dictation engine reference for history UI
 
@@ -294,6 +295,7 @@ def on_token_handler(token: str):
 def main() -> None:
     """Main daemon entry point."""
     global _global_dialogue_memory, _global_stop_requested, _global_tts_engine, _global_dictation_engine
+    global _warm_profile_graph_listener
 
     # Reset stop flag at start (in case of restart)
     _global_stop_requested = False
@@ -348,6 +350,60 @@ def main() -> None:
     )
     print("✓ Dialogue memory initialized", flush=True)
 
+    # Wire the conversation-scoped warm-profile cache to graph mutations.
+    # When the User or Directives branch is mutated mid-conversation, the
+    # cached warm profile is dropped so the next reply rebuilds it from
+    # the current graph state. World-branch writes (typical webSearch
+    # extractions) do not touch warm profile, so they are ignored.
+    try:
+        from .memory.graph import (
+            BRANCH_DIRECTIVES,
+            BRANCH_USER,
+            register_graph_mutation_listener,
+        )
+
+        _wp_relevant_branches = {BRANCH_USER, BRANCH_DIRECTIVES}
+
+        # Read the DialogueMemory ref through the module global at fire
+        # time, not via closure capture, so a future singleton swap (tests
+        # or hot-reload) routes invalidation to the live instance instead
+        # of the freed one.
+        def _invalidate_wp_on_graph_mutation(*, action, node_id, branch):
+            del action, node_id  # Only the branch matters for warm-profile filtering.
+            if branch not in _wp_relevant_branches:
+                return
+            dm = _global_dialogue_memory
+            if dm is None:
+                return
+            try:
+                dm.invalidate_warm_profile()
+                debug_log(
+                    f"warm profile invalidated by {branch} graph mutation",
+                    "memory",
+                )
+            except Exception as exc:
+                debug_log(
+                    f"warm profile invalidation failed (non-fatal): {exc}",
+                    "memory",
+                )
+
+        # If a previous run left a listener registered (re-entry without
+        # full process restart), drop it before installing the new one so
+        # the registry never accumulates stale closures.
+        if _warm_profile_graph_listener is not None:
+            try:
+                from .memory.graph import unregister_graph_mutation_listener
+                unregister_graph_mutation_listener(_warm_profile_graph_listener)
+            except Exception:
+                pass
+        register_graph_mutation_listener(_invalidate_wp_on_graph_mutation)
+        _warm_profile_graph_listener = _invalidate_wp_on_graph_mutation
+    except Exception as exc:
+        debug_log(
+            f"warm profile mutation listener wiring failed (non-fatal): {exc}",
+            "memory",
+        )
+
     # Knowledge graph: wipe + re-seed if the on-disk shape predates the
     # User/Directives/World taxonomy. Non-destructive to the diary —
     # users can re-import via the memory viewer.
@@ -567,6 +623,19 @@ def stdin_monitor():
         if tts is not None:
             tts.stop()
         db.close()
+
+        # Drop the warm-profile graph listener so the module registry does
+        # not retain a closure pointing at this run's DialogueMemory after
+        # shutdown — relevant for tests and any embedder that re-runs the
+        # daemon in-process.
+        if _warm_profile_graph_listener is not None:
+            try:
+                from .memory.graph import unregister_graph_mutation_listener
+                unregister_graph_mutation_listener(_warm_profile_graph_listener)
+            except Exception:
+                pass
+            _warm_profile_graph_listener = None
+
         debug_log("daemon stopped", "jarvis")
         print("👋 Daemon stopped", flush=True)