Summary
Issue334 後の Phase2 short smoke では、worker path 自体は live で観測された。
worker_observed=true
mutation_observed=true
fixslice_escalation_stagnation_count=1
一方で A1 は最後の 1 item を閉じられず、late-stage closure loop のまま partial で停止した。
completion_kind=partial
accepted_final_count=0
final_suppressed_with_remaining_targets_count=4
plan_update_count=9
sync_from_touched_files_count=0
つまり、Issue334 により read_heavy -> worker escalation -> mutation は達成できたが、最後の未完了 item を clean に閉じる runtime closure / termination path がまだ壊れている。
今回の blocking defect は benchmark harness ではなく Anvil runtime 側にある。
Reproduction / observed behavior
Run:
- pack:
autonomy-v2-p2-fix-slice-v1
- expectation:
requires_worker_observation
- results dir:
commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_173634/
- provider/model: Ollama
qwen3.5:122b + sidecar qwen3.5:9b
A1 telemetry:
worker_observed=true
mutation_observed=true
repair_turn_observed=false
completion_kind=partial
plan_update_count=9
sync_from_touched_files_count=0
A1 log highlights:
fixslice_escalation_triggered (read_heavy)
agent.fix_slice invoked multiple times
file.edit success on src/lib/detection/prompt-detector.ts
plan item completed (all target_files mutated) for the prompt-detector.ts item
plan-aware final gate: suppressing ANVIL_FINAL (premature)
- repeated
late-stage closure mode: remaining=1, injecting closure hint
- repeated follow-up
ANVIL_PLAN / ANVIL_PLAN_UPDATE that keep reintroducing unchecked tests/unit/prompt-detector.test.ts work
- no accepted final before run interruption
Observed runtime path:
- read-heavy stagnation
- worker escalation succeeds
- first mutation succeeds
- one plan item is completed
- remaining becomes 1
- closure mode injects hints
- follow-up plan updates append a fresh unchecked last item again
- final gate keeps suppressing ANVIL_FINAL
- session remains
partial
Root cause
1. Late-stage closure mode is advisory only; it does not block unchecked plan expansion
remaining==1 currently activates build_late_stage_closure_hint(), but that path only injects stronger guidance.
It does not enforce any structural restriction on follow-up ANVIL_PLAN / ANVIL_PLAN_UPDATE blocks.
By contrast, pre-exit repair closure mode (repair_closure_active) explicitly rejects unchecked appended items.
This asymmetry matters:
- normal late-stage closure mode: unchecked replan/update expansion is still accepted
- repair closure mode: unchecked replan/update expansion is rejected
So once the session reaches remaining=1, the model can keep proposing a corrected unchecked final item, and runtime keeps accepting it.
2. The plan update pipeline can churn the same last target indefinitely
The shared update pipeline does:
- checked-first retire
- supersede stale items
- deduplicate new items
- append deduped items
In the current run, this interacts badly with the final item churn:
- the existing unfinished last item is superseded
- the replacement item survives dedup
- the new unchecked item is appended
remaining stays 1, but the item identity is refreshed
This is enabled by the current semantics where Superseded items are excluded from dedup matching so corrected replacements can survive. That behavior is correct in general, but it becomes pathological in late-stage closure mode when the replacement is effectively the same last target again.
3. The touched-file rescue path is not closing late-appended final items
check_plan_final_gate_inner() calls sync_from_touched_files() before evaluating the final gate.
That rescue path exists specifically to mark items done when the file was already touched but the plan state lagged behind.
However, in the failing A1 run:
sync_from_touched_files_count=0
- repeated final suppressions continue
So the late-appended final item is not being structurally reconciled against already-observed file changes.
Whether the exact local cause is:
- the last item truly was never mutated, or
- the item was reintroduced after mutation and should have been retired/deduped/reconciled,
the runtime defect is the same: late-stage closure allows endless last-item churn without a hard closure rule.
Why this belongs to Anvil
localllm-test is not creating this loop.
The benchmark runner only:
- launches the suite
- reads telemetry/logs/artifacts
- classifies the run result
The loop is already visible inside the raw Anvil runtime behavior:
- repeated
late-stage closure mode: remaining=1
- repeated
ANVIL_FINAL suppression
- repeated plan updates
- no accepted final
So this issue belongs to Anvil runtime plan/closure logic, not harness-side wiring.
Concrete problem areas
Primary hotspots:
src/app/execution_plan.rs
apply_plan_update_pipeline()
check_plan_final_gate_inner()
inject_plan_turn_guidance()
src/contracts/mod.rs
deduplicate_new_items()
supersede_stale_items()
sync_from_touched_files()
src/app/agentic.rs
- follow-up
ANVIL_PLAN / ANVIL_PLAN_UPDATE handling around replan / update application
- pre-exit repair closure handling, for comparison with late-stage closure semantics
Fix direction
1. Add a structural closure guard for remaining==1
When late-stage closure mode is active, runtime should stop treating unchecked follow-up plan expansion as a normal replan path.
Candidate policy:
- if
remaining==1 and the new unchecked items only restate / refine the current last target, reject them instead of appending
- or, activate the same unchecked-item rejection policy used by
repair_closure_active
- or, auto-upgrade normal closure mode into a stricter closure state after the first final suppression at
remaining==1
The important property is: late-stage closure must become structurally narrowing, not advisory-only.
2. Reconcile late-appended items against known file state before append/final-gate churn
If a newly proposed last item matches:
- an already touched file, or
- an already mutated target in the current plan lineage,
it should be retired or deduped instead of appended as a fresh actionable blocker.
Possible implementations:
- before
append_items(), reconcile deduped items against working_memory.touched_files
- treat same-target replacements in closure mode as already satisfied when the target was already mutated
- add a closure-mode-specific dedup rule that does not ignore superseded ancestry for the final remaining target
3. Keep Superseded-skip dedup for general replans, but narrow it in closure mode
The current Superseded skip exists for a good reason, so removing it globally is risky.
A safer approach is to scope the special handling:
- preserve current semantics during ordinary replans
- tighten semantics only when
remaining==1 or closure mode is active
That avoids regressing the earlier fixes while stopping the last-item churn.
4. Add regression coverage for the exact A1 failure shape
Needed regression:
- read-heavy stagnation reaches worker escalation
- one target mutates successfully
- plan reaches
remaining=1
- follow-up
ANVIL_PLAN / ANVIL_PLAN_UPDATE restates the same final target
- runtime does not append indefinite fresh blockers
- session can either:
- finish cleanly with accepted final, or
- enter strict repair closure that rejects unchecked expansion and terminates deterministically
Candidate acceptance criteria
Desk-check: if this issue is cleared, does it achieve the original goal?
Likely yes, with rerun confirmation still required.
Issue334 already moved the system past the original worker-path blocker:
- worker path is now live-observed
- mutation is now live-observed
The remaining blocker seen in the latest Phase2 run is closure stability.
Therefore, if this issue is fixed as described, the previously missing piece is removed:
- the session should be able to convert worker-mediated progress into a clean completion rather than looping at
remaining=1
That means clearing this issue should make the original Phase2 objective practically reachable:
- live worker observation
- real mutation evidence
- valid completion instead of
partial
Strictly speaking, final proof still requires retest evidence. But unlike the earlier issues, this now appears to be the last major runtime blocker on the critical path.
Issue quality check / brush-up
The main risk in this issue is blaming only prompt/model behavior. That would be too weak.
The stronger and more actionable framing is:
- the model may emit noisy follow-up replans,
- but runtime late-stage closure currently has no structural rule to stop that noise from re-creating the last blocker,
- therefore the bug is in runtime closure semantics, not merely in model quality.
This issue is intentionally framed around that runtime contract so the fix can be validated by regression tests and not by hoping for a luckier model sample.
Summary
Issue334 後の Phase2 short smoke では、worker path 自体は live で観測された。
worker_observed=truemutation_observed=truefixslice_escalation_stagnation_count=1一方で A1 は最後の 1 item を閉じられず、late-stage closure loop のまま
partialで停止した。completion_kind=partialaccepted_final_count=0final_suppressed_with_remaining_targets_count=4plan_update_count=9sync_from_touched_files_count=0つまり、Issue334 により
read_heavy -> worker escalation -> mutationは達成できたが、最後の未完了 item を clean に閉じる runtime closure / termination path がまだ壊れている。今回の blocking defect は benchmark harness ではなく
Anvilruntime 側にある。Reproduction / observed behavior
Run:
autonomy-v2-p2-fix-slice-v1requires_worker_observationcommandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_173634/qwen3.5:122b+ sidecarqwen3.5:9bA1 telemetry:
worker_observed=truemutation_observed=truerepair_turn_observed=falsecompletion_kind=partialplan_update_count=9sync_from_touched_files_count=0A1 log highlights:
fixslice_escalation_triggered (read_heavy)agent.fix_sliceinvoked multiple timesfile.edit successonsrc/lib/detection/prompt-detector.tsplan item completed (all target_files mutated)for theprompt-detector.tsitemplan-aware final gate: suppressing ANVIL_FINAL (premature)late-stage closure mode: remaining=1, injecting closure hintANVIL_PLAN/ANVIL_PLAN_UPDATEthat keep reintroducing uncheckedtests/unit/prompt-detector.test.tsworkObserved runtime path:
partialRoot cause
1. Late-stage closure mode is advisory only; it does not block unchecked plan expansion
remaining==1currently activatesbuild_late_stage_closure_hint(), but that path only injects stronger guidance.It does not enforce any structural restriction on follow-up
ANVIL_PLAN/ANVIL_PLAN_UPDATEblocks.By contrast, pre-exit repair closure mode (
repair_closure_active) explicitly rejects unchecked appended items.This asymmetry matters:
So once the session reaches
remaining=1, the model can keep proposing a corrected unchecked final item, and runtime keeps accepting it.2. The plan update pipeline can churn the same last target indefinitely
The shared update pipeline does:
In the current run, this interacts badly with the final item churn:
remainingstays 1, but the item identity is refreshedThis is enabled by the current semantics where
Supersededitems are excluded from dedup matching so corrected replacements can survive. That behavior is correct in general, but it becomes pathological in late-stage closure mode when the replacement is effectively the same last target again.3. The touched-file rescue path is not closing late-appended final items
check_plan_final_gate_inner()callssync_from_touched_files()before evaluating the final gate.That rescue path exists specifically to mark items done when the file was already touched but the plan state lagged behind.
However, in the failing A1 run:
sync_from_touched_files_count=0So the late-appended final item is not being structurally reconciled against already-observed file changes.
Whether the exact local cause is:
the runtime defect is the same: late-stage closure allows endless last-item churn without a hard closure rule.
Why this belongs to Anvil
localllm-testis not creating this loop.The benchmark runner only:
The loop is already visible inside the raw Anvil runtime behavior:
late-stage closure mode: remaining=1ANVIL_FINALsuppressionSo this issue belongs to
Anvilruntime plan/closure logic, not harness-side wiring.Concrete problem areas
Primary hotspots:
src/app/execution_plan.rsapply_plan_update_pipeline()check_plan_final_gate_inner()inject_plan_turn_guidance()src/contracts/mod.rsdeduplicate_new_items()supersede_stale_items()sync_from_touched_files()src/app/agentic.rsANVIL_PLAN/ANVIL_PLAN_UPDATEhandling around replan / update applicationFix direction
1. Add a structural closure guard for
remaining==1When late-stage closure mode is active, runtime should stop treating unchecked follow-up plan expansion as a normal replan path.
Candidate policy:
remaining==1and the new unchecked items only restate / refine the current last target, reject them instead of appendingrepair_closure_activeremaining==1The important property is: late-stage closure must become structurally narrowing, not advisory-only.
2. Reconcile late-appended items against known file state before append/final-gate churn
If a newly proposed last item matches:
it should be retired or deduped instead of appended as a fresh actionable blocker.
Possible implementations:
append_items(), reconcile deduped items againstworking_memory.touched_files3. Keep
Superseded-skip dedup for general replans, but narrow it in closure modeThe current
Supersededskip exists for a good reason, so removing it globally is risky.A safer approach is to scope the special handling:
remaining==1or closure mode is activeThat avoids regressing the earlier fixes while stopping the last-item churn.
4. Add regression coverage for the exact A1 failure shape
Needed regression:
remaining=1ANVIL_PLAN/ANVIL_PLAN_UPDATErestates the same final targetCandidate acceptance criteria
remaining==1late-stage closure path cannot append the same effective final target indefinitelyANVIL_PLAN/ANVIL_PLAN_UPDATEitems are rejected or structurally reconciled when closure mode is activesync_from_touched_files()or equivalent closure reconciliation advances the final item in the late-stage churn casecompletion_kind=partialafter worker-mediated mutationDesk-check: if this issue is cleared, does it achieve the original goal?
Likely yes, with rerun confirmation still required.
Issue334 already moved the system past the original worker-path blocker:
The remaining blocker seen in the latest Phase2 run is closure stability.
Therefore, if this issue is fixed as described, the previously missing piece is removed:
remaining=1That means clearing this issue should make the original Phase2 objective practically reachable:
partialStrictly speaking, final proof still requires retest evidence. But unlike the earlier issues, this now appears to be the last major runtime blocker on the critical path.
Issue quality check / brush-up
The main risk in this issue is blaming only prompt/model behavior. That would be too weak.
The stronger and more actionable framing is:
This issue is intentionally framed around that runtime contract so the fix can be validated by regression tests and not by hoping for a luckier model sample.