fix(memory): deterministic deflection scrub for diary writes + bulk clean button#323
Merged
fix(memory): deterministic deflection scrub for diary writes + bulk clean button#323
Conversation
…lean button The summariser prompt rule against narrating assistant failures (added in #232) reduces but does not eliminate the leak on small models. Field measurement on a real diary showed roughly 40% of post-rule writes on gemma4:e2b still contained banned phrasing, and pre-rule rows remained poisoned with no way to clean them. Adds a deterministic safety net mirroring the knowledge graph's two-layer defence (extractor BANNED FACT FORMS at write-time + merge-time rewrite for historical data): - `scrub_deflection_sentences()`: drops whole sentences matching narrow regex patterns ("the assistant could not...", "offered to search...", "lacks access...", etc.). Keeps the original if scrubbing would empty the row; an empty diary entry is worse than a slightly-leaky one. Idempotent. - Wired into `update_daily_conversation_summary` so every new write is cleaned before it lands in `conversation_summaries`. - `scrub_all_diary_summaries()`: bulk sweep over every row; same semantics, fail-open per row, streams privacy-safe events (counts only, never raw summary text). - `POST /api/diary/scrub-deflections`: NDJSON-streaming endpoint backed by the bulk sweep. - Memory viewer: "Clean up deflection narration" button under a new Maintenance section in the diary sidebar. Modal copy is explicit about what is removed and what stays, so users do not worry about silent data loss. Spec updates: `summariser.spec.md` documents the post-process layer + UI; `docs/llm_contexts.md` notes the scrub in the summariser data flow; CLAUDE.md spec registry refreshed. Tests: 30 unit tests covering scrub patterns, edge cases (empty input, total-deflection rows, idempotence), DB integration of the sweep, streaming endpoint contract, and a privacy regression that asserts diary content cannot leak through the streaming UI. Eval extended with a defence-in-depth check that asserts post-scrub output is clean even when the live LLM leaks. Closes #259 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ntics Field bug: clicking "Clean up deflection narration" did nothing. The handler was wired inside `initGraph()`, which only runs when the user opens the Knowledge tab. Diary is the default tab, so a user who never visited Knowledge first never had the click handler attached. Wires the handler in the always-run page setup section. Adds a regression test that asserts the wiring is structurally outside `initGraph()` so a future refactor cannot reintroduce the bug. While here, addresses the verified findings from the multi-agent review: - Preserve `ts_utc` on bulk-sweep writes. `db.upsert_conversation_summary` gains an optional `ts_utc` parameter; the sweep passes through the row's original write time so a maintenance pass cannot stomp the audit trail (every cleaned row would otherwise look as though it had been written today). - Refresh the vector embedding inline when the bulk sweep rewrites a row. Without this the embedding stays anchored to the pre-scrub text and vector search drifts from FTS results. Best-effort: an embedding service failure is logged but does not roll back the summary write. When the caller has no embed model configured (offline-only setup), the sweep skips re-embedding and reports `embedding_refreshed=False`. - Surface only the exception class name through the streaming UI's `error` field (and the modal's status copy on terminal errors). A stringified Python exception can echo offending input back to the caller; class names cannot. - Construct modal log entries with `textContent` on a real DOM node rather than `innerHTML +=`. Defence-in-depth: even with the privacy whitelist, no path should be able to inject markup into the modal log. - Close the database connection in `finally` so a mid-iteration exception cannot orphan it until GC. - Lock the streaming progress event shape behind a whitelist test (`test_progress_event_keys_are_a_known_whitelist`). Any future field added to the sweep's event dict that could carry summary text now trips this test, forcing deliberate review. Adds tests for the `stated…` and `indicated…` regex branches that had zero coverage, multi-line summaries with `\n` between sentences, and summaries without terminal punctuation. README troubleshooting entry for users hitting poisoned diary rows; spec updated with the audit-trail and embedding-refresh contracts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit triggered by the hot-cache surface introduced in #281: does the deflection scrub need a listener-style invalidation hook similar to the graph's invalidate_warm_profile? Verified the answer is no, and pinned the contract so a future change that starts caching diary content has to opt in deliberately: - DialogueMemory's hot cache holds three values: warm_profile_block (graph-derived, not diary), the per-query router decision, and the per-query memory-extractor parameters. None contain diary text. - search_conversation_memory_by_keywords reads SQLite live on every enrichment-bearing turn; the engine never stashes diary entries across turns. - Concurrency between the sweep and an in-flight reply is handled by SQLite WAL: separate Database instances on different connections, WAL serialises writes against reads at the file level. Spec gains a "Cache invariant" paragraph documenting the rule and the counterpart relationship to the graph's listener pattern. Regression test seeds a row, populates the engine's hot cache with extractor params, runs the scrub without invalidating the cache, and asserts the follow-up search returns cleaned content. If a future change adds a diary-content cache without an invalidation path, this test still passes while the user observes stale results, so the spec note is the load-bearing piece; the test catches the obvious regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #259.
The summariser prompt's deflection rule (added in #232) reduces but does not eliminate "the assistant could not / offered to search / did not have…" leaks on small models. Field measurement on a real diary showed roughly 40% of post-rule writes on gemma4:e2b still contained banned phrasing, and pre-rule rows had no way to be cleaned. Once recalled by enrichment, those entries prime the next reply to repeat the deflection.
This PR brings the diary up to parity with the knowledge graph's two-layer defence (extractor BANNED FACT FORMS at write-time, deterministic merge-time rewrite for historical data) by:
scrub_deflection_sentences()runs on every diary write and drops whole sentences matching narrow patterns. Keeps the original if scrubbing would empty the row; an empty diary entry is worse than a slightly-leaky one. Idempotent.scrub_all_diary_summaries()walks every row inconversation_summaries. Surfaced viaPOST /api/diary/scrub-deflections(NDJSON streaming) and a "Clean up deflection narration" button under a new Maintenance section in the diary sidebar. Preserves the row's originalts_utcso the audit trail survives the sweep, and refreshes the vector embedding inline if an embed model is configured.errorvalue is the exception class name only (e.g."RuntimeError"), never the stringified message, because Python exception messages can echo offending input back to the caller. The progress event shape is locked behind a whitelist regression test so any future field addition forces deliberate review.textContenton real DOM nodes so a malformeddate_utcor error class name cannot inject markup.PR #281's caching scope is intentionally left untouched per the original brief.
Why this approach (vs. enrichment-side filter)
#259 originally proposed (A) summariser hygiene + (B) enrichment-side filter. With (A) hardened to a two-layer defence (prompt + scrub) and the bulk sweep cleaning historical poisoning, (B) becomes unnecessary by construction: there is no poisoned content left at recall time.
Field-driven design
The pattern set is intentionally narrow (English-first, "the assistant
<failure verb>" canonical shape). False positives erase real content, which poisons the diary in a different way (silent fact loss). Precision over recall: a few residual leaks are survivable, an erased preference is not. The summariser prompt rule itself remains language-agnostic.Field bug caught during review
Initial commit wired the diary maintenance click handler inside
initGraph(), which only runs when the user opens the Knowledge tab. The diary tab is the default — meaning a user who clicked the button without ever visiting Knowledge first got no response. Fixed in the second commit (handler now lives on the always-run page setup path) and a regression test asserts the wiring is structurally outsideinitGraph()so a future refactor cannot reintroduce the bug.Test plan
tests/test_diary_deflection_scrub.py): pattern coverage across 13 phrase variants including thestated…andindicated…branches, edge cases (empty, idempotence, total-deflection, multi-line summaries with\n, summaries without terminal punctuation), and a write-path integration test that proves the scrub fires before the row is persisted.tests/test_diary_scrub_sweep.py): every row visited, idempotent, fail-open per row, no diary content in event payloads, total-deflection rows kept rather than emptied,ts_utcpreservation, error field contains exception class name only, embedding refresh behaviour without an embed model.tests/test_memory_viewer_diary_scrub_api.py): NDJSON streaming, write-back, privacy regression asserting diary content cannot leak via the streaming UI, progress-event key whitelist locking the shape, aggregate-count contract, regression test that the click handler is wired outsideinitGraph().evals/test_diary_summariser_hygiene.py): defence-in-depth check that the post-scrub output is clean even when the live LLM leaks.tests/test_dialogue_memory.py::test_update_diary_preserves_new_messages_during_slow_llmreproduces ondevelopwith no changes — not introduced by this PR.initGraph(), XSS-hardened log construction).Spec / docs
src/jarvis/memory/summariser.spec.md: post-process scrub, bulk-sweep UI, privacy contract, audit-trail and embedding-refresh contracts documented alongside the existing rules; eval/regression-guard table extended.docs/llm_contexts.md: summariser data flow updated to note the scrub between LLM output and DB.CLAUDE.mdspec registry: summariser entry refreshed to reflect the two-layer defence.README.md: troubleshooting note for users hitting poisoned diary rows pointing at the new clean button.