Skip to content

bug: pre-exit repair turn 応答が新規 pending plan を append しても loop が即 terminate し Phase2 B1 が partial で閉じる #327

@Kewton

Description

@Kewton

Summary

Phase2 retest after #325 on HEAD bd1255a still leaves the same pack red, but the failure shape has shifted again.

In autonomy_v2_p2_fix_slice_v1_20260410_105239:

  • A1 finishes as complete_unverified
  • B1 now reaches runner-level session_completed / command_return_code=0
  • pre_exit_repair_injected_count=1 and pre_exit_repair_consumed_count=1 are both observed

However, B1 still ends with:

  • internal completion_kind=partial
  • plan_items=26 plan_finished=24
  • fixslice_escalation_count=0
  • changed_files=0
  • diff_stat=no changes

So #325 fixed the "repair turn never executes" bug, but the repair-turn path can still terminate with unfinished plan items and no worker adoption.

Reproduction / observed behavior

  • HEAD under test: bd1255a (Merge pull request #326 from Kewton/feature/issue-325-repair-turn-continuation)
  • result dir: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_105239/
  • run: B1
  • provider/model: Ollama qwen3.5:122b + sidecar qwen3.5:9b
  • context/max output: 65536

External result shape

B1_result.json now shows:

  • valid_run=true
  • exit_class=session_completed
  • command_return_code=0
  • changed_files=0

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_105239/B1_result.json

What the log shows

  1. The repair turn is actually injected and consumed.
  • pre-exit repair turn injected; continuing for one more LLM turn
  • pre-exit repair turn consumed; terminating loop

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_105239/B1.log:543,569

  1. On that consumed repair turn, the model emits another ANVIL_PLAN_UPDATE instead of simply closing the remaining work.

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_105239/B1.log:551-558

  1. Runtime applies part of that update, but also appends fresh unchecked items:
  • retires src/lib/auto-yes-poller.ts
  • supersedes old src/lib/detection/status-detector.ts and src/app/api/worktrees/[id]/current-output/route.ts items
  • plan update pipeline: appending items new_items=2

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_105239/B1.log:565-568

  1. Immediately after appending those two new pending items, the loop terminates anyway:
  • completion_kind=partial plan_items=26 plan_finished=24
  • telemetry fixslice_escalation_count=0
  • telemetry pre_exit_repair_injected_count=1 pre_exit_repair_consumed_count=1

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_105239/B1.log:569-574

Why this is distinct from #325

#325 was specifically:

  • inject repair message
  • but break before the model can ever consume it

That part is now fixed.

The new blocker is:

  • the repair turn does run
  • but its response can still expand / refresh the unfinished plan
  • and the loop still terminates immediately after consuming that one response

So the runtime now has a narrower residual bug: the repair-turn response is processed, but any newly appended actionable items created by that response are never given another execution chance.

Source-backed root cause

1. pre_exit_repair_injected currently means "always break after the next response", regardless of what that response did

Main loop:

  • src/app/agentic.rs:1086-1092

Once pre_exit_repair_injected is true, the next response is processed and then the loop unconditionally does:

  • record_pre_exit_repair_consumed()
  • pre-exit repair turn consumed; terminating loop
  • break

There is no post-repair decision branch such as:

  • "the plan is now cleanly finished, so exit"
  • "the plan is still incomplete but repair made progress, so continue boundedly"
  • "the repair response introduced fresh work, so reject that shape explicitly"

The current meaning is just "one repair response was seen, therefore stop."

2. The repair prompt is closure-oriented, but it does not constrain the response shape tightly enough

The injected repair message says:

  • mutate the remaining file, or
  • if no change is needed, retire it with ANVIL_PLAN_UPDATE [x], then
  • emit ANVIL_FINAL
  • shell.exec is not needed

Source:

  • src/app/mod.rs:850-879

That instruction is helpful, but it does not forbid the model from emitting a mixed ANVIL_PLAN_UPDATE that contains:

  • one checked item
  • plus fresh unchecked items

So the runtime enters a closure-only phase, but the protocol still accepts plan-expanding responses.

3. The normal plan-update pipeline still appends unchecked items during the repair turn

Plan update handling:

  • src/app/execution_plan.rs:125-203

The pipeline does:

  1. checked-first retire
  2. filter_unchecked_items(...)
  3. supersede_stale_items(&new_items)
  4. append_items(deduped) when unchecked items remain

That behavior is reasonable during ordinary replanning, but on the pre-exit repair turn it creates a contradiction:

  • runtime says "last chance: close remaining work"
  • runtime still accepts "here are two more unchecked actionable items"
  • runtime then terminates immediately because the repair response has been consumed

In other words, repair mode currently reuses the same replan semantics as normal exploration mode, even though the loop is about to stop.

4. This makes partial inevitable whenever the repair response expands the plan

The latest B1 proves the sequence end-to-end:

  1. repair turn injected
  2. repair-turn response arrives
  3. one item retired, two items appended
  4. loop terminates on the consumed-repair branch
  5. completion_kind=partial

That is not just "the model needed one more chance."

It is a control-flow / protocol mismatch:

  • control flow: consumed repair response => stop
  • protocol: consumed repair response may still create new pending work

As long as both remain true, this residual failure family will survive.

Impact

  • Phase2 B1 can look externally healthier (session_completed, exit code 0) while still failing the correctness gate internally (completion_kind=partial)
  • The repair-turn mechanism is now live but still not safe as a closure protocol
  • The latest same-pack suite can be misread as "fixed enough" unless the telemetry is inspected
  • Worker-on runs still produce no live agent.fix_slice / file.rewrite evidence

Fix direction

1. Make pre-exit repair a real closure mode, not ordinary replan mode

When the session enters the pre-exit repair turn, the accepted response space should be narrowed.

Options:

  1. Reject unchecked plan expansion during repair mode

    • If ANVIL_PLAN_UPDATE on the repair turn contains unchecked items, do not append them
    • Instead inject a strict system correction and continue boundedly, or mark the run unresolved explicitly
  2. Allow progress but require another bounded turn when fresh work is introduced

    • If the repair-turn response retires some items but also appends new ones, do not break immediately
    • Continue with a small, explicit "repair-follow-up" budget

Either is better than the current "append and immediately terminate partial."

2. Replace unconditional consumed-repair break with a post-repair decision

After the repair-turn response is parsed and plan updates are applied:

  • if the plan is cleanly finished: allow exit
  • if the response introduced new pending items: do not exit as though closure succeeded
  • if the response made no closure progress: terminate with an explicit unresolved reason

The key change is that "repair turn consumed" must stop being synonymous with "session should now end."

3. Add repair-turn-specific telemetry

Current counters tell us only that the repair turn was injected and consumed.

We also need to know:

  • repair turn retired item count
  • repair turn appended item count
  • remaining items before / after repair turn
  • whether exit happened with pending items newly introduced by the repair response

That would make the current failure shape first-class instead of reconstructing it from logs.

4. Add regression coverage for the exact residual branch

Test shape:

  1. incomplete plan near termination
  2. escape hatch injects repair turn
  3. repair-turn response contains:
    • at least one [x] item
    • at least one unchecked item
  4. runtime applies the update

Expected:

  • runtime does not append fresh unchecked items and then immediately terminate as partial
  • either the unchecked items are rejected in repair mode, or the loop continues under a bounded follow-up policy

Acceptance criteria

  • A consumed pre-exit repair turn cannot append new unfinished plan items and then immediately terminate as partial
  • Repair mode has an explicit policy for unchecked ANVIL_PLAN_UPDATE items (reject, bounded continue, or equivalent)
  • Post-repair termination is decided from the resulting plan state, not just from the fact that one repair response was consumed
  • Telemetry exposes whether the repair turn appended new work and how many items remained before/after it
  • A regression test covers the "repair turn appends 2 new items then exits partial" path
  • Re-running autonomy_v2_p2_fix_slice_v1_20260410_105239 no longer fails for this specific residual repair-turn behavior

Desk-check: if this issue is cleared, does it achieve the original goal?

Short answer: no, not by itself. It is necessary, but not sufficient.

What clearing this issue should achieve

If the above is fixed, the latest B1 family should stop failing in this exact way:

  • repair turn runs
  • repair response appends fresh pending items
  • loop exits immediately as partial

That would remove the residual correctness bug left after #325 and should improve the latest Phase2 red point materially.

Why it is still not enough for the original Phase2 objective

The original Phase2 objective is not only "avoid partial closure drift."
It also requires bench-visible proof that the microtask worker path is actually reachable and integrated.

That is still not true in the latest suite:

  • fixslice_escalation_count=0 on both A1 and B1
  • no live agent.fix_slice hit observed
  • no file.rewrite application observed
  • diff artifact remains empty

Sources:

  • commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_105239/A1.log:355-366
  • commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_105239/B1.log:569-574

So the desk-check conclusion is:

That means this issue should be treated as:

  • a correctness residual required to stabilize Phase2 closure behavior
  • but not the only remaining gate for the original Phase2 goal

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions