Staging by pnewsam · Pull Request #192 · mindsdb/anton

pnewsam · 2026-06-18T19:59:14Z

No description provided.

…-exceeded [ENG-344] Error message for tokens exceeded

* ENG-350: cap scratchpad inactivity window regardless of estimate The inactivity timer scaled unbounded with the cell's estimate (inactivity = est*0.5), so an over-estimated cell (e.g. est=600) allowed 5 minutes of *silent* execution before being killed — a core cause of cells appearing to "run forever" with no output. Clamp the silence window to cell_inactivity_max (default 60s, tunable via ANTON_CELL_INACTIVITY_MAX). stdout/progress() still reset the window, so legitimate long-but-active cells (e.g. a batch loop pinging progress()) are unaffected. The total timeout is deliberately left scaling so those active cells can run to completion; only genuinely stuck/silent cells die fast now. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ENG-350: self-correcting empty-code failure + name-agnostic kill-loop detector Two related fixes for the large-output retry loop: 1. Empty `code` on an exec is the large-payload drop (oversized arg truncated to "" in transit), not a no-op. Replace the bare "No code provided." with actionable recovery guidance (write to disk in small append steps, or generate in-cell) and phrase it as a failure so the per-tool error streak in _apply_error_tracking counts it toward the circuit breaker instead of silently resetting on every retry. 2. detect_kill_loop now fires on >= N kills across the turn regardless of scratchpad name, not just N kills on one name. Renaming the scratchpad between failed attempts used to split the count across buckets and hide the loop. Updated the corresponding test to assert the new behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ENG-350: make the resilience nudge failure-type-aware The generic RESILIENCE_NUDGE is scrape/fetch advice ("try a public API / archive.org / different headers"). Appended to a repeated *scratchpad* failure it misdirects — a cell that's too big or too slow doesn't need a different data source, it needs to be chunked or scoped down, which is what pushed the model toward rename-and-retry churn. Add SCRATCHPAD_SIZE_NUDGE and SCRATCHPAD_TIMEOUT_NUDGE and route by failure type in _select_resilience_nudge: scratchpad timeout -> "make the cell smaller / split the loop / use progress()"; scratchpad empty-code/too-big -> "write incrementally or generate in-cell"; generic scratchpad errors get no (misleading) nudge; all other tools keep the generic nudge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ENG-350: challenge a second scratchpad per task (enforce single-scratchpad) The agent frequently spins up multiple scratchpads for one task (build_pres -> write_html -> pres1 ...). Each name is a separate isolated process, so state from one isn't visible in another — the model re-imports, re-fetches, and shuffles state across pads, burning rounds. The prompt already says to use ONE scratchpad; this enforces it. handle_scratchpad now challenges an exec on a NEW scratchpad name when the agent already has one in use this task, returning guidance to reuse the existing pad. An explicit confirm_new_scratchpad=true (new optional schema field) bypasses it for the rare genuine-isolation case. Names are tracked in session._agent_scratchpad_names (only names the agent exec'd), NOT _scratchpads.pads — so system-created pads (e.g. the artifact backend launcher's slug pad) never count against the agent. The challenge carries no error marker so it doesn't trip the circuit breaker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ENG-350: adversarial-review hardening Fixes found by red-teaming the PR: 1. Single-scratchpad guard could induce its own loop. The challenge returns a non-error string, so a model that keeps requesting new names without confirming would be re-challenged every round with nothing to stop it (the challenge resets no streak; circuit breaker never fires). Now challenge AT MOST ONCE per session (session._scratchpad_challenged), then respect the model's choice — one firm nudge is the enforcement. 2. Failure-type nudge over-matched. Keying the size nudge on "too large"/ "truncated" would misfire on unrelated errors (e.g. a MySQL "Data truncated for column" warning). Match the empty-code message phrase ("argument was empty") specifically. 3. No total backstop for an actively-printing runaway. The inactivity cap can't catch a cell that keeps producing output (while True: print(...)). Add optional cell_total_max (default 0 = off) so operators can bound total runtime without clipping legit long batch loops; apply the inactivity cap consistently to both estimate branches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ENG-350: make the guard + scratchpad ACC events fire on the streaming path too Review finding: anton has two exec paths. turn() (CLI) routes through handle_scratchpad; turn_stream() (what cowork/cowork-server uses) handles exec inline and bypasses handle_scratchpad. So the single-scratchpad guard and the scratchpad ACC events (scratchpad_call/killed/empty_code) — all of which lived in handle_scratchpad — never fired in the streaming product, leaving detect_kill_loop / detect_oversized_cell / detect_name_switch blind there and the guard inert. Fix by centralizing on the shared entry points both paths already call: - Move the single-scratchpad guard and the pre-execute ACC events (scratchpad_empty_code, scratchpad_call) into prepare_scratchpad_exec. - Add observe_scratchpad_cell() for the post-execute event (killed vs result) and call it from BOTH handle_scratchpad and the streaming exec block. handle_scratchpad now just delegates; net ACC events emitted on the CLI path are unchanged. Incidental fix: a failed package-install string no longer mis-emits scratchpad_empty_code (the emit is now scoped to the empty-code branch inside prepare). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ENG-350: default ANTON_ACC_MODE=active (mid-turn self-correction on by default) Team decision: ship the ACC mid-turn nudge on by default rather than gate it behind an eval — an off-by-default flag rots into forgotten noise, and the self-correction is most of the value. Revert path stays one env var (ANTON_ACC_MODE=passive to learn-next-turn, =off to disable). Running the e2e suite under the new default surfaced an interaction from the earlier guardrail work (not the flip itself): - _select_resilience_nudge returned "" for a generic scratchpad error, suppressing the nudge entirely. The ticket only called for avoiding scraping advice on size/timeout failures — a generic syntax/runtime error should still get the generic "failed twice, change approach" nudge. Fixed. - The two loop-safety e2e scenarios queued errors across DISTINCT scratchpad names, which now trips the single-scratchpad guard (challenge resets the streak) and conflates with the breaker/nudge being tested. Reused one name so they exercise the consecutive-error path they're actually about; the guard has its own tests. Full suite green: 35 e2e + 181 unit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds an act_first config flag (AntonSettings.act_first / ChatSessionConfig.act_first, default True) that selects the conversation-discipline posture: • act_first=True → bias toward action; act on reasonable defaults and STATE each assumption inline as it's made so the user can redirect mid-flight; only stop to ask when a wrong guess is costly/irreversible or unknowable. • act_first=False → the previous cautious ask-first discipline. prompts.py exposes both blocks (CONVERSATION_DISCIPLINE_ACT_FIRST/ASK_FIRST) and a {conversation_discipline} slot; the builder picks one from the flag. Wired through session + the chat_session/chat/runtime entry points. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- if a scratchpad cell errors the same way twice, change strategy (don't re-run the same code) - validate output before claiming a task is done; report what was verified Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make the system-prompt prefix byte-stable across a task's turns so providers can prefix-cache it (and so behavior is deterministic): - date is task-anchored + date-only (ChatSessionConfig.clock, e.g. the conversation's created_at) instead of a per-turn minute clock; - the relevance-filtered memory snapshot moves to the very end (volatile tail) so it never invalidates the stable content above it. Stacked on the act_first branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rework _summarize_history: - 3b-light: frame the summary as REFERENCE ONLY (latest user message wins; don't resume superseded/cancelled work) — protects Anton's auto-continue verifier from resurrecting stale tasks after a compaction. - 3b-full: emit a structured STATE RECORD (Goal/Constraints/Completed/Active state/Blocked/Decisions/Remaining) instead of freeform bullets, and UPDATE a prior summary in place (via a sentinel marker) rather than summarizing a summary, so 'Remaining' work survives across compactions. Stacked on the cache-stable-prompt chain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

compaction: structured, reference-only, in-place summary (Phase 3b)

…mmary keeps dates PR feedback: anchoring the date to created_at but labeling it 'current' meant a conversation resumed days/weeks later reported the wrong current time. Fix: - prefix carries a FIXED 'Conversation started: {date}' line (cache-stable); - the real wall clock is emitted in the volatile tail ('Current date and time: …'), recomputed each turn, so it's always accurate and never busts the cached prefix; - rename ChatSessionConfig.clock → started_at to match; - the 3b summarizer now preserves key event dates so the timeline survives compaction (per-message timestamps come from the harness embedding each message's created_at). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

prompt: cache-stable assembly (2a) — task-anchored date + memory at tail

prompt: act-first posture (do first, surface assumptions) + execution discipline

tino097 and others added 7 commits June 18, 2026 12:57

return friendly message when tokens are exceeded

f0f5708

return friendly message when tokens are exceeded

ea1cd93

Add tests for checking the messages

97e04ae

Merge pull request #189 from mindsdb/fix/eng-344-error-message-tokens…

cdd0ce2

…-exceeded [ENG-344] Error message for tokens exceeded

prompt: add two execution-discipline rules

aed3456

- if a scratchpad cell errors the same way twice, change strategy (don't re-run the same code) - validate output before claiming a task is done; report what was verified Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

torrmal requested review from alecantu7 and Copilot and removed request for Copilot June 18, 2026 22:50

Copilot started reviewing on behalf of torrmal June 18, 2026 22:51 View session

torrmal and others added 2 commits June 18, 2026 16:03

tino097 approved these changes Jun 19, 2026

View reviewed changes

torrmal and others added 4 commits June 19, 2026 12:55

Merge pull request #195 from mindsdb/feat/compaction-quality

a3ae216

compaction: structured, reference-only, in-place summary (Phase 3b)

Merge pull request #194 from mindsdb/feat/cache-stable-prompt

1cf63e1

prompt: cache-stable assembly (2a) — task-anchored date + memory at tail

Merge pull request #193 from mindsdb/feat/act-and-surface-assumptions

d490ae7

prompt: act-first posture (do first, surface assumptions) + execution discipline

torrmal added this pull request to the merge queue Jun 20, 2026

Merged via the queue into main with commit c140a8a Jun 20, 2026
4 of 5 checks passed

github-actions Bot locked and limited conversation to collaborators Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging#192

Staging#192
torrmal merged 13 commits into
mainfrom
staging

pnewsam commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants