Skip to content

Phase2: late-stage closure loop keeps appending last plan item after Issue334 #336

@Kewton

Description

@Kewton

Summary

Issue334 後の Phase2 short smoke では、worker path 自体は live で観測された。

  • worker_observed=true
  • mutation_observed=true
  • fixslice_escalation_stagnation_count=1

一方で A1 は最後の 1 item を閉じられず、late-stage closure loop のまま partial で停止した。

  • completion_kind=partial
  • accepted_final_count=0
  • final_suppressed_with_remaining_targets_count=4
  • plan_update_count=9
  • sync_from_touched_files_count=0

つまり、Issue334 により read_heavy -> worker escalation -> mutation は達成できたが、最後の未完了 item を clean に閉じる runtime closure / termination path がまだ壊れている

今回の blocking defect は benchmark harness ではなく Anvil runtime 側にある。

Reproduction / observed behavior

Run:

  • pack: autonomy-v2-p2-fix-slice-v1
  • expectation: requires_worker_observation
  • results dir: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_173634/
  • provider/model: Ollama qwen3.5:122b + sidecar qwen3.5:9b

A1 telemetry:

  • worker_observed=true
  • mutation_observed=true
  • repair_turn_observed=false
  • completion_kind=partial
  • plan_update_count=9
  • sync_from_touched_files_count=0

A1 log highlights:

  1. fixslice_escalation_triggered (read_heavy)
  2. agent.fix_slice invoked multiple times
  3. file.edit success on src/lib/detection/prompt-detector.ts
  4. plan item completed (all target_files mutated) for the prompt-detector.ts item
  5. plan-aware final gate: suppressing ANVIL_FINAL (premature)
  6. repeated late-stage closure mode: remaining=1, injecting closure hint
  7. repeated follow-up ANVIL_PLAN / ANVIL_PLAN_UPDATE that keep reintroducing unchecked tests/unit/prompt-detector.test.ts work
  8. no accepted final before run interruption

Observed runtime path:

  1. read-heavy stagnation
  2. worker escalation succeeds
  3. first mutation succeeds
  4. one plan item is completed
  5. remaining becomes 1
  6. closure mode injects hints
  7. follow-up plan updates append a fresh unchecked last item again
  8. final gate keeps suppressing ANVIL_FINAL
  9. session remains partial

Root cause

1. Late-stage closure mode is advisory only; it does not block unchecked plan expansion

remaining==1 currently activates build_late_stage_closure_hint(), but that path only injects stronger guidance.
It does not enforce any structural restriction on follow-up ANVIL_PLAN / ANVIL_PLAN_UPDATE blocks.

By contrast, pre-exit repair closure mode (repair_closure_active) explicitly rejects unchecked appended items.

This asymmetry matters:

  • normal late-stage closure mode: unchecked replan/update expansion is still accepted
  • repair closure mode: unchecked replan/update expansion is rejected

So once the session reaches remaining=1, the model can keep proposing a corrected unchecked final item, and runtime keeps accepting it.

2. The plan update pipeline can churn the same last target indefinitely

The shared update pipeline does:

  • checked-first retire
  • supersede stale items
  • deduplicate new items
  • append deduped items

In the current run, this interacts badly with the final item churn:

  • the existing unfinished last item is superseded
  • the replacement item survives dedup
  • the new unchecked item is appended
  • remaining stays 1, but the item identity is refreshed

This is enabled by the current semantics where Superseded items are excluded from dedup matching so corrected replacements can survive. That behavior is correct in general, but it becomes pathological in late-stage closure mode when the replacement is effectively the same last target again.

3. The touched-file rescue path is not closing late-appended final items

check_plan_final_gate_inner() calls sync_from_touched_files() before evaluating the final gate.
That rescue path exists specifically to mark items done when the file was already touched but the plan state lagged behind.

However, in the failing A1 run:

  • sync_from_touched_files_count=0
  • repeated final suppressions continue

So the late-appended final item is not being structurally reconciled against already-observed file changes.

Whether the exact local cause is:

  • the last item truly was never mutated, or
  • the item was reintroduced after mutation and should have been retired/deduped/reconciled,

the runtime defect is the same: late-stage closure allows endless last-item churn without a hard closure rule.

Why this belongs to Anvil

localllm-test is not creating this loop.
The benchmark runner only:

  • launches the suite
  • reads telemetry/logs/artifacts
  • classifies the run result

The loop is already visible inside the raw Anvil runtime behavior:

  • repeated late-stage closure mode: remaining=1
  • repeated ANVIL_FINAL suppression
  • repeated plan updates
  • no accepted final

So this issue belongs to Anvil runtime plan/closure logic, not harness-side wiring.

Concrete problem areas

Primary hotspots:

  • src/app/execution_plan.rs
    • apply_plan_update_pipeline()
    • check_plan_final_gate_inner()
    • inject_plan_turn_guidance()
  • src/contracts/mod.rs
    • deduplicate_new_items()
    • supersede_stale_items()
    • sync_from_touched_files()
  • src/app/agentic.rs
    • follow-up ANVIL_PLAN / ANVIL_PLAN_UPDATE handling around replan / update application
    • pre-exit repair closure handling, for comparison with late-stage closure semantics

Fix direction

1. Add a structural closure guard for remaining==1

When late-stage closure mode is active, runtime should stop treating unchecked follow-up plan expansion as a normal replan path.

Candidate policy:

  • if remaining==1 and the new unchecked items only restate / refine the current last target, reject them instead of appending
  • or, activate the same unchecked-item rejection policy used by repair_closure_active
  • or, auto-upgrade normal closure mode into a stricter closure state after the first final suppression at remaining==1

The important property is: late-stage closure must become structurally narrowing, not advisory-only.

2. Reconcile late-appended items against known file state before append/final-gate churn

If a newly proposed last item matches:

  • an already touched file, or
  • an already mutated target in the current plan lineage,

it should be retired or deduped instead of appended as a fresh actionable blocker.

Possible implementations:

  • before append_items(), reconcile deduped items against working_memory.touched_files
  • treat same-target replacements in closure mode as already satisfied when the target was already mutated
  • add a closure-mode-specific dedup rule that does not ignore superseded ancestry for the final remaining target

3. Keep Superseded-skip dedup for general replans, but narrow it in closure mode

The current Superseded skip exists for a good reason, so removing it globally is risky.

A safer approach is to scope the special handling:

  • preserve current semantics during ordinary replans
  • tighten semantics only when remaining==1 or closure mode is active

That avoids regressing the earlier fixes while stopping the last-item churn.

4. Add regression coverage for the exact A1 failure shape

Needed regression:

  • read-heavy stagnation reaches worker escalation
  • one target mutates successfully
  • plan reaches remaining=1
  • follow-up ANVIL_PLAN / ANVIL_PLAN_UPDATE restates the same final target
  • runtime does not append indefinite fresh blockers
  • session can either:
    • finish cleanly with accepted final, or
    • enter strict repair closure that rejects unchecked expansion and terminates deterministically

Candidate acceptance criteria

  • A remaining==1 late-stage closure path cannot append the same effective final target indefinitely
  • unchecked follow-up ANVIL_PLAN / ANVIL_PLAN_UPDATE items are rejected or structurally reconciled when closure mode is active
  • same-target final-item replacements are deduped/retired when prior touched/mutated evidence already exists
  • sync_from_touched_files() or equivalent closure reconciliation advances the final item in the late-stage churn case
  • a regression test reproduces the A1 pattern and proves deterministic closure
  • Phase2 retest no longer stalls in completion_kind=partial after worker-mediated mutation

Desk-check: if this issue is cleared, does it achieve the original goal?

Likely yes, with rerun confirmation still required.

Issue334 already moved the system past the original worker-path blocker:

  • worker path is now live-observed
  • mutation is now live-observed

The remaining blocker seen in the latest Phase2 run is closure stability.
Therefore, if this issue is fixed as described, the previously missing piece is removed:

  • the session should be able to convert worker-mediated progress into a clean completion rather than looping at remaining=1

That means clearing this issue should make the original Phase2 objective practically reachable:

  • live worker observation
  • real mutation evidence
  • valid completion instead of partial

Strictly speaking, final proof still requires retest evidence. But unlike the earlier issues, this now appears to be the last major runtime blocker on the critical path.

Issue quality check / brush-up

The main risk in this issue is blaming only prompt/model behavior. That would be too weak.
The stronger and more actionable framing is:

  • the model may emit noisy follow-up replans,
  • but runtime late-stage closure currently has no structural rule to stop that noise from re-creating the last blocker,
  • therefore the bug is in runtime closure semantics, not merely in model quality.

This issue is intentionally framed around that runtime contract so the fix can be validated by regression tests and not by hoping for a luckier model sample.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions