Conversation
…-exceeded [ENG-344] Error message for tokens exceeded
* ENG-350: cap scratchpad inactivity window regardless of estimate
The inactivity timer scaled unbounded with the cell's estimate
(inactivity = est*0.5), so an over-estimated cell (e.g. est=600) allowed
5 minutes of *silent* execution before being killed — a core cause of
cells appearing to "run forever" with no output.
Clamp the silence window to cell_inactivity_max (default 60s, tunable via
ANTON_CELL_INACTIVITY_MAX). stdout/progress() still reset the window, so
legitimate long-but-active cells (e.g. a batch loop pinging progress())
are unaffected. The total timeout is deliberately left scaling so those
active cells can run to completion; only genuinely stuck/silent cells die
fast now.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ENG-350: self-correcting empty-code failure + name-agnostic kill-loop detector
Two related fixes for the large-output retry loop:
1. Empty `code` on an exec is the large-payload drop (oversized arg
truncated to "" in transit), not a no-op. Replace the bare
"No code provided." with actionable recovery guidance (write to disk
in small append steps, or generate in-cell) and phrase it as a failure
so the per-tool error streak in _apply_error_tracking counts it toward
the circuit breaker instead of silently resetting on every retry.
2. detect_kill_loop now fires on >= N kills across the turn regardless of
scratchpad name, not just N kills on one name. Renaming the scratchpad
between failed attempts used to split the count across buckets and hide
the loop. Updated the corresponding test to assert the new behavior.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ENG-350: make the resilience nudge failure-type-aware
The generic RESILIENCE_NUDGE is scrape/fetch advice ("try a public API /
archive.org / different headers"). Appended to a repeated *scratchpad*
failure it misdirects — a cell that's too big or too slow doesn't need a
different data source, it needs to be chunked or scoped down, which is what
pushed the model toward rename-and-retry churn.
Add SCRATCHPAD_SIZE_NUDGE and SCRATCHPAD_TIMEOUT_NUDGE and route by failure
type in _select_resilience_nudge: scratchpad timeout -> "make the cell
smaller / split the loop / use progress()"; scratchpad empty-code/too-big ->
"write incrementally or generate in-cell"; generic scratchpad errors get no
(misleading) nudge; all other tools keep the generic nudge.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ENG-350: challenge a second scratchpad per task (enforce single-scratchpad)
The agent frequently spins up multiple scratchpads for one task
(build_pres -> write_html -> pres1 ...). Each name is a separate isolated
process, so state from one isn't visible in another — the model re-imports,
re-fetches, and shuffles state across pads, burning rounds. The prompt
already says to use ONE scratchpad; this enforces it.
handle_scratchpad now challenges an exec on a NEW scratchpad name when the
agent already has one in use this task, returning guidance to reuse the
existing pad. An explicit confirm_new_scratchpad=true (new optional schema
field) bypasses it for the rare genuine-isolation case.
Names are tracked in session._agent_scratchpad_names (only names the agent
exec'd), NOT _scratchpads.pads — so system-created pads (e.g. the artifact
backend launcher's slug pad) never count against the agent. The challenge
carries no error marker so it doesn't trip the circuit breaker.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ENG-350: adversarial-review hardening
Fixes found by red-teaming the PR:
1. Single-scratchpad guard could induce its own loop. The challenge returns
a non-error string, so a model that keeps requesting new names without
confirming would be re-challenged every round with nothing to stop it
(the challenge resets no streak; circuit breaker never fires). Now
challenge AT MOST ONCE per session (session._scratchpad_challenged),
then respect the model's choice — one firm nudge is the enforcement.
2. Failure-type nudge over-matched. Keying the size nudge on "too large"/
"truncated" would misfire on unrelated errors (e.g. a MySQL "Data
truncated for column" warning). Match the empty-code message phrase
("argument was empty") specifically.
3. No total backstop for an actively-printing runaway. The inactivity cap
can't catch a cell that keeps producing output (while True: print(...)).
Add optional cell_total_max (default 0 = off) so operators can bound
total runtime without clipping legit long batch loops; apply the
inactivity cap consistently to both estimate branches.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ENG-350: make the guard + scratchpad ACC events fire on the streaming path too
Review finding: anton has two exec paths. turn() (CLI) routes through
handle_scratchpad; turn_stream() (what cowork/cowork-server uses) handles
exec inline and bypasses handle_scratchpad. So the single-scratchpad guard
and the scratchpad ACC events (scratchpad_call/killed/empty_code) — all of
which lived in handle_scratchpad — never fired in the streaming product,
leaving detect_kill_loop / detect_oversized_cell / detect_name_switch blind
there and the guard inert.
Fix by centralizing on the shared entry points both paths already call:
- Move the single-scratchpad guard and the pre-execute ACC events
(scratchpad_empty_code, scratchpad_call) into prepare_scratchpad_exec.
- Add observe_scratchpad_cell() for the post-execute event (killed vs result)
and call it from BOTH handle_scratchpad and the streaming exec block.
handle_scratchpad now just delegates; net ACC events emitted on the CLI path
are unchanged. Incidental fix: a failed package-install string no longer
mis-emits scratchpad_empty_code (the emit is now scoped to the empty-code
branch inside prepare).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ENG-350: default ANTON_ACC_MODE=active (mid-turn self-correction on by default)
Team decision: ship the ACC mid-turn nudge on by default rather than gate it
behind an eval — an off-by-default flag rots into forgotten noise, and the
self-correction is most of the value. Revert path stays one env var
(ANTON_ACC_MODE=passive to learn-next-turn, =off to disable).
Running the e2e suite under the new default surfaced an interaction from the
earlier guardrail work (not the flip itself):
- _select_resilience_nudge returned "" for a generic scratchpad error,
suppressing the nudge entirely. The ticket only called for avoiding
scraping advice on size/timeout failures — a generic syntax/runtime error
should still get the generic "failed twice, change approach" nudge. Fixed.
- The two loop-safety e2e scenarios queued errors across DISTINCT scratchpad
names, which now trips the single-scratchpad guard (challenge resets the
streak) and conflates with the breaker/nudge being tested. Reused one name
so they exercise the consecutive-error path they're actually about; the
guard has its own tests.
Full suite green: 35 e2e + 181 unit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds an act_first config flag (AntonSettings.act_first / ChatSessionConfig.act_first,
default True) that selects the conversation-discipline posture:
• act_first=True → bias toward action; act on reasonable defaults and STATE
each assumption inline as it's made so the user can redirect mid-flight;
only stop to ask when a wrong guess is costly/irreversible or unknowable.
• act_first=False → the previous cautious ask-first discipline.
prompts.py exposes both blocks (CONVERSATION_DISCIPLINE_ACT_FIRST/ASK_FIRST) and a
{conversation_discipline} slot; the builder picks one from the flag. Wired through
session + the chat_session/chat/runtime entry points.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- if a scratchpad cell errors the same way twice, change strategy (don't re-run the same code) - validate output before claiming a task is done; report what was verified Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the system-prompt prefix byte-stable across a task's turns so providers can prefix-cache it (and so behavior is deterministic): - date is task-anchored + date-only (ChatSessionConfig.clock, e.g. the conversation's created_at) instead of a per-turn minute clock; - the relevance-filtered memory snapshot moves to the very end (volatile tail) so it never invalidates the stable content above it. Stacked on the act_first branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rework _summarize_history: - 3b-light: frame the summary as REFERENCE ONLY (latest user message wins; don't resume superseded/cancelled work) — protects Anton's auto-continue verifier from resurrecting stale tasks after a compaction. - 3b-full: emit a structured STATE RECORD (Goal/Constraints/Completed/Active state/Blocked/Decisions/Remaining) instead of freeform bullets, and UPDATE a prior summary in place (via a sentinel marker) rather than summarizing a summary, so 'Remaining' work survives across compactions. Stacked on the cache-stable-prompt chain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tino097
approved these changes
Jun 19, 2026
compaction: structured, reference-only, in-place summary (Phase 3b)
…mmary keeps dates
PR feedback: anchoring the date to created_at but labeling it 'current' meant a
conversation resumed days/weeks later reported the wrong current time. Fix:
- prefix carries a FIXED 'Conversation started: {date}' line (cache-stable);
- the real wall clock is emitted in the volatile tail ('Current date and time: …'),
recomputed each turn, so it's always accurate and never busts the cached prefix;
- rename ChatSessionConfig.clock → started_at to match;
- the 3b summarizer now preserves key event dates so the timeline survives compaction
(per-message timestamps come from the harness embedding each message's created_at).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
prompt: cache-stable assembly (2a) — task-anchored date + memory at tail
prompt: act-first posture (do first, surface assumptions) + execution discipline
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.