Skip to content

fix(memory): deterministic deflection scrub for diary writes + bulk clean button#323

Merged
isair merged 3 commits intodevelopfrom
fix/diary-deflection-scrub
May 2, 2026
Merged

fix(memory): deterministic deflection scrub for diary writes + bulk clean button#323
isair merged 3 commits intodevelopfrom
fix/diary-deflection-scrub

Conversation

@isair
Copy link
Copy Markdown
Owner

@isair isair commented May 2, 2026

Summary

Closes #259.

The summariser prompt's deflection rule (added in #232) reduces but does not eliminate "the assistant could not / offered to search / did not have…" leaks on small models. Field measurement on a real diary showed roughly 40% of post-rule writes on gemma4:e2b still contained banned phrasing, and pre-rule rows had no way to be cleaned. Once recalled by enrichment, those entries prime the next reply to repeat the deflection.

This PR brings the diary up to parity with the knowledge graph's two-layer defence (extractor BANNED FACT FORMS at write-time, deterministic merge-time rewrite for historical data) by:

  1. Write-time scrubscrub_deflection_sentences() runs on every diary write and drops whole sentences matching narrow patterns. Keeps the original if scrubbing would empty the row; an empty diary entry is worse than a slightly-leaky one. Idempotent.
  2. Bulk sweep clean buttonscrub_all_diary_summaries() walks every row in conversation_summaries. Surfaced via POST /api/diary/scrub-deflections (NDJSON streaming) and a "Clean up deflection narration" button under a new Maintenance section in the diary sidebar. Preserves the row's original ts_utc so the audit trail survives the sweep, and refreshes the vector embedding inline if an embed model is configured.
  3. Privacy contract — the streaming progress endpoint emits only counts and dates, never raw summary text. The error value is the exception class name only (e.g. "RuntimeError"), never the stringified message, because Python exception messages can echo offending input back to the caller. The progress event shape is locked behind a whitelist regression test so any future field addition forces deliberate review.
  4. Modal copy is explicit — tells users exactly what is removed (deflection sentences) and what stays (everything else, including the row itself if it would otherwise be emptied), so the clean button does not feel like a silent data-loss action. Modal log entries are constructed with textContent on real DOM nodes so a malformed date_utc or error class name cannot inject markup.

PR #281's caching scope is intentionally left untouched per the original brief.

Why this approach (vs. enrichment-side filter)

#259 originally proposed (A) summariser hygiene + (B) enrichment-side filter. With (A) hardened to a two-layer defence (prompt + scrub) and the bulk sweep cleaning historical poisoning, (B) becomes unnecessary by construction: there is no poisoned content left at recall time.

Field-driven design

The pattern set is intentionally narrow (English-first, "the assistant <failure verb>" canonical shape). False positives erase real content, which poisons the diary in a different way (silent fact loss). Precision over recall: a few residual leaks are survivable, an erased preference is not. The summariser prompt rule itself remains language-agnostic.

Field bug caught during review

Initial commit wired the diary maintenance click handler inside initGraph(), which only runs when the user opens the Knowledge tab. The diary tab is the default — meaning a user who clicked the button without ever visiting Knowledge first got no response. Fixed in the second commit (handler now lives on the always-run page setup path) and a regression test asserts the wiring is structurally outside initGraph() so a future refactor cannot reintroduce the bug.

Test plan

  • 30 unit tests for the scrub function (tests/test_diary_deflection_scrub.py): pattern coverage across 13 phrase variants including the stated… and indicated… branches, edge cases (empty, idempotence, total-deflection, multi-line summaries with \n, summaries without terminal punctuation), and a write-path integration test that proves the scrub fires before the row is persisted.
  • 9 DB-integration tests for the bulk sweep (tests/test_diary_scrub_sweep.py): every row visited, idempotent, fail-open per row, no diary content in event payloads, total-deflection rows kept rather than emptied, ts_utc preservation, error field contains exception class name only, embedding refresh behaviour without an embed model.
  • 5 endpoint tests (tests/test_memory_viewer_diary_scrub_api.py): NDJSON streaming, write-back, privacy regression asserting diary content cannot leak via the streaming UI, progress-event key whitelist locking the shape, aggregate-count contract, regression test that the click handler is wired outside initGraph().
  • Live eval (evals/test_diary_summariser_hygiene.py): defence-in-depth check that the post-scrub output is clean even when the live LLM leaks.
  • 373 tests pass across the diary/memory/graph/engine surface. Pre-existing flake on tests/test_dialogue_memory.py::test_update_diary_preserves_new_messages_during_slow_llm reproduces on develop with no changes — not introduced by this PR.
  • Memory viewer HTML structure verified end-to-end via Flask test client (button id, label, modal copy, endpoint wiring, CSS class, handler placement outside initGraph(), XSS-hardened log construction).

Spec / docs

  • src/jarvis/memory/summariser.spec.md: post-process scrub, bulk-sweep UI, privacy contract, audit-trail and embedding-refresh contracts documented alongside the existing rules; eval/regression-guard table extended.
  • docs/llm_contexts.md: summariser data flow updated to note the scrub between LLM output and DB.
  • CLAUDE.md spec registry: summariser entry refreshed to reflect the two-layer defence.
  • README.md: troubleshooting note for users hitting poisoned diary rows pointing at the new clean button.

isair and others added 3 commits May 2, 2026 19:12
…lean button

The summariser prompt rule against narrating assistant failures (added in
#232) reduces but does not eliminate the leak on small models. Field
measurement on a real diary showed roughly 40% of post-rule writes on
gemma4:e2b still contained banned phrasing, and pre-rule rows remained
poisoned with no way to clean them.

Adds a deterministic safety net mirroring the knowledge graph's two-layer
defence (extractor BANNED FACT FORMS at write-time + merge-time rewrite
for historical data):

- `scrub_deflection_sentences()`: drops whole sentences matching narrow
  regex patterns ("the assistant could not...", "offered to search...",
  "lacks access...", etc.). Keeps the original if scrubbing would empty
  the row; an empty diary entry is worse than a slightly-leaky one.
  Idempotent.
- Wired into `update_daily_conversation_summary` so every new write is
  cleaned before it lands in `conversation_summaries`.
- `scrub_all_diary_summaries()`: bulk sweep over every row; same
  semantics, fail-open per row, streams privacy-safe events (counts only,
  never raw summary text).
- `POST /api/diary/scrub-deflections`: NDJSON-streaming endpoint backed
  by the bulk sweep.
- Memory viewer: "Clean up deflection narration" button under a new
  Maintenance section in the diary sidebar. Modal copy is explicit about
  what is removed and what stays, so users do not worry about silent
  data loss.

Spec updates: `summariser.spec.md` documents the post-process layer + UI;
`docs/llm_contexts.md` notes the scrub in the summariser data flow;
CLAUDE.md spec registry refreshed.

Tests: 30 unit tests covering scrub patterns, edge cases (empty input,
total-deflection rows, idempotence), DB integration of the sweep,
streaming endpoint contract, and a privacy regression that asserts
diary content cannot leak through the streaming UI. Eval extended with
a defence-in-depth check that asserts post-scrub output is clean even
when the live LLM leaks.

Closes #259

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ntics

Field bug: clicking "Clean up deflection narration" did nothing. The
handler was wired inside `initGraph()`, which only runs when the user
opens the Knowledge tab. Diary is the default tab, so a user who never
visited Knowledge first never had the click handler attached.

Wires the handler in the always-run page setup section. Adds a
regression test that asserts the wiring is structurally outside
`initGraph()` so a future refactor cannot reintroduce the bug.

While here, addresses the verified findings from the multi-agent review:

- Preserve `ts_utc` on bulk-sweep writes. `db.upsert_conversation_summary`
  gains an optional `ts_utc` parameter; the sweep passes through the
  row's original write time so a maintenance pass cannot stomp the
  audit trail (every cleaned row would otherwise look as though it had
  been written today).
- Refresh the vector embedding inline when the bulk sweep rewrites a
  row. Without this the embedding stays anchored to the pre-scrub text
  and vector search drifts from FTS results. Best-effort: an embedding
  service failure is logged but does not roll back the summary write.
  When the caller has no embed model configured (offline-only setup),
  the sweep skips re-embedding and reports `embedding_refreshed=False`.
- Surface only the exception class name through the streaming UI's
  `error` field (and the modal's status copy on terminal errors). A
  stringified Python exception can echo offending input back to the
  caller; class names cannot.
- Construct modal log entries with `textContent` on a real DOM node
  rather than `innerHTML +=`. Defence-in-depth: even with the privacy
  whitelist, no path should be able to inject markup into the modal
  log.
- Close the database connection in `finally` so a mid-iteration
  exception cannot orphan it until GC.
- Lock the streaming progress event shape behind a whitelist test
  (`test_progress_event_keys_are_a_known_whitelist`). Any future field
  added to the sweep's event dict that could carry summary text now
  trips this test, forcing deliberate review.

Adds tests for the `stated…` and `indicated…` regex branches that had
zero coverage, multi-line summaries with `\n` between sentences, and
summaries without terminal punctuation.

README troubleshooting entry for users hitting poisoned diary rows;
spec updated with the audit-trail and embedding-refresh contracts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit triggered by the hot-cache surface introduced in #281: does the
deflection scrub need a listener-style invalidation hook similar to the
graph's invalidate_warm_profile?

Verified the answer is no, and pinned the contract so a future change
that starts caching diary content has to opt in deliberately:

- DialogueMemory's hot cache holds three values: warm_profile_block
  (graph-derived, not diary), the per-query router decision, and the
  per-query memory-extractor parameters. None contain diary text.
- search_conversation_memory_by_keywords reads SQLite live on every
  enrichment-bearing turn; the engine never stashes diary entries
  across turns.
- Concurrency between the sweep and an in-flight reply is handled by
  SQLite WAL: separate Database instances on different connections,
  WAL serialises writes against reads at the file level.

Spec gains a "Cache invariant" paragraph documenting the rule and the
counterpart relationship to the graph's listener pattern. Regression
test seeds a row, populates the engine's hot cache with extractor
params, runs the scrub without invalidating the cache, and asserts the
follow-up search returns cleaned content. If a future change adds a
diary-content cache without an invalidation path, this test still
passes while the user observes stale results, so the spec note is the
load-bearing piece; the test catches the obvious regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@isair isair merged commit 04ec7d5 into develop May 2, 2026
2 checks passed
@isair isair deleted the fix/diary-deflection-scrub branch May 2, 2026 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant