Summary
Phase2 short smoke after Issue332 still fails the worker-required gate.
The latest run on Anvil 95727f76ac512ddcf4a7f973e7449f59c544b63b improved termination behavior to session_completed / complete_unverified, but it still did not produce live worker-path evidence:
worker_observed=false
fixslice_escalation_count=0
fixslice_escalation_stagnation_count=0
changed_files=0
diff_stat=no changes
repair_turn_observed=true
- runner result:
exit_class=expectation_mismatch
This is a runtime behavior problem in Anvil, not a benchmark harness misclassification.
Reproduction / observed behavior
Latest suite:
- results dir:
commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_162147/
- pack:
autonomy-v2-p2-fix-slice-v1
- expectation:
requires_worker_observation
- provider/model: Ollama
qwen3.5:122b + sidecar qwen3.5:9b
A1 external result
A1_result.json:
valid_run=true
exit_class=expectation_mismatch
command_return_code=0
changed_files=0
diff_stat=no changes
worker_observed=false
repair_turn_observed=true
mutation_observed=false
expectation_mismatch_reason="pack requires worker_observed=true but worker path was not observed (repair_turn_observed=true)"
A1 telemetry / log highlights
telemetry/A1/..._telemetry.json:
completion_kind=complete_unverified
worker_observed=false
repair_turn_observed=true
fixslice_escalation_count=0
fixslice_escalation_stagnation_count=0
A1.log:
- repeated
stagnation detected; forced_mode_active=true
pre-exit repair turn injected
pre-exit repair turn consumed; plan cleanly finished
pack validation gate: satisfied
So the runtime is still taking the path:
- repeated read / audit drift
- stagnation warnings
- no worker escalation
- late repair-turn salvage
complete_unverified
- zero persistent diff
Root cause
1. The broadened Issue332 escalation is still gated on edit failures
The stagnation-driven worker escalation in src/app/agentic.rs only fires when:
worker_observed is still false
should_escalate_for_stagnation(...) returns true
But should_escalate_for_stagnation(...) depends on edit_fail_tracker.total_failures(), not on stagnation alone.
Relevant code:
src/app/agentic.rs around the Issue332 block
src/app/edit_fail_tracker.rs::should_escalate_for_stagnation
src/app/agentic.rs edit-fail tracker updates for failed file.edit / file.edit_anchor / file.rewrite
This means the current policy only broadens escalation from:
- same-path repeated failures
to:
- cross-path repeated edit failures
It does not cover the actual Phase2 drift pattern we just observed:
- repeated reads
- repeated replans / plan-aware suppression
- stagnation score high
- almost no failed edit attempts before pre-exit repair
In A1 there was only one visible file.edit, and it was the late repair-turn attempt. So total_failures never crossed the Issue332 threshold, and agent.fix_slice was never forced.
This is the primary blocking root cause.
2. Runtime-side requires_worker_observation semantics are still too broad
In src/contracts/mod.rs, runtime validation for RequiresWorkerObservation currently returns Satisfied when any of these are true:
worker_observed
repair_turn_observed
mutation_observed
That means the runtime still treats repair-only or mutation-only salvage as satisfying a worker-required pack.
This is directly visible in the latest run:
- runner result:
pack_validation_result=mismatch
- runtime log:
pack validation gate: satisfied
So the runtime and benchmark still disagree on the definition of worker success.
This is a second runtime-side defect in Anvil.
Why this belongs to Anvil
The latest benchmark runner is doing the intended strict classification:
- it requires
worker_observed=true for requires_worker_observation
- it explicitly marks repair-only salvage as mismatch
The benchmark did not hide the defect. It exposed two runtime gaps:
- worker escalation policy does not match the real drift pattern
- runtime pack validation semantics still accept non-worker salvage
Fix direction
1. Add a read-drift / no-mutation escalation path that does not require edit failures
Issue332 solved only the cross-path edit-failure case. Phase2 still needs a second trigger for the actual observed pattern:
- stagnation score high
- repeated read / search loops
- no mutation for N turns
- plan repair / final suppression already happened
Candidate implementations:
- Escalate to
agent.fix_slice when stagnation score stays above threshold for M turns and no mutation has been observed.
- Escalate when repeated-read warnings or forced workset transitions cross a threshold under worker-on Phase2 conditions.
- Escalate after
pre_exit_repair would otherwise be injected, instead of allowing repair-only salvage to satisfy the loop.
The key requirement is that worker escalation must no longer depend exclusively on edit_fail_tracker.total_failures().
2. Tighten runtime pack validation semantics
RequiresWorkerObservation should only be Satisfied when worker_observed=true.
Repair-only or mutation-only salvage may still be tracked for diagnostics, but they should not satisfy the worker-required gate.
That change should be made in the runtime telemetry validation path so:
- runtime log
- telemetry artifact
- benchmark result
all agree on the same outcome.
3. Keep repair-salvage telemetry, but treat it as a separate outcome
The existing counters are still useful:
repair_turn_observed
fixslice_escalation_repair_salvage_count
But they should stay diagnostic-only and must not stand in for worker success.
4. Add regression coverage for the actual read-heavy drift pattern
Issue332 tests cover stagnation + cumulative edit failures.
They do not yet prove behavior for:
- high stagnation
- repeated reads
- no persistent mutation
- no edit-failure accumulation
Add a regression that reproduces this specific pattern and asserts:
- worker escalation happens before repair-only closure
worker_observed=true
- runtime pack validation is mismatch unless worker is actually observed
Candidate acceptance criteria
Desk-check: if this issue is cleared, does it achieve the original goal?
Conditionally yes.
If this issue is cleared exactly as described above, then the remaining blockers observed in the latest Phase2 run are removed:
- the runtime can escalate to worker in the real read-heavy drift pattern
- runtime gate semantics match the worker-required benchmark gate
Under that condition, a Phase2 retest can genuinely demonstrate the original objective:
- live worker-path observation
- non-zero worker-mediated mutation evidence
- no false pass from repair-only salvage
However, clearing this issue is still not a mathematical guarantee that the benchmark will pass on the first retest. It is the right root-cause fix, but the actual goal is only achieved once a rerun shows:
worker_observed=true
- persistent diff / touched-files evidence
- no regression in completion quality
So the desk-check conclusion is:
- Yes as a necessary fix: this issue addresses the runtime defects actually blocking Phase2
- Conditionally yes as a sufficient practical fix: if the implementation matches the acceptance criteria, the original Phase2 objective should become reachable on retest
- Final proof still requires rerun evidence
Summary
Phase2 short smoke after Issue332 still fails the worker-required gate.
The latest run on Anvil
95727f76ac512ddcf4a7f973e7449f59c544b63bimproved termination behavior tosession_completed/complete_unverified, but it still did not produce live worker-path evidence:worker_observed=falsefixslice_escalation_count=0fixslice_escalation_stagnation_count=0changed_files=0diff_stat=no changesrepair_turn_observed=trueexit_class=expectation_mismatchThis is a runtime behavior problem in
Anvil, not a benchmark harness misclassification.Reproduction / observed behavior
Latest suite:
commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_162147/autonomy-v2-p2-fix-slice-v1requires_worker_observationqwen3.5:122b+ sidecarqwen3.5:9bA1 external result
A1_result.json:valid_run=trueexit_class=expectation_mismatchcommand_return_code=0changed_files=0diff_stat=no changesworker_observed=falserepair_turn_observed=truemutation_observed=falseexpectation_mismatch_reason="pack requires worker_observed=true but worker path was not observed (repair_turn_observed=true)"A1 telemetry / log highlights
telemetry/A1/..._telemetry.json:completion_kind=complete_unverifiedworker_observed=falserepair_turn_observed=truefixslice_escalation_count=0fixslice_escalation_stagnation_count=0A1.log:stagnation detected; forced_mode_active=truepre-exit repair turn injectedpre-exit repair turn consumed; plan cleanly finishedpack validation gate: satisfiedSo the runtime is still taking the path:
complete_unverifiedRoot cause
1. The broadened Issue332 escalation is still gated on edit failures
The stagnation-driven worker escalation in
src/app/agentic.rsonly fires when:worker_observedis still falseshould_escalate_for_stagnation(...)returns trueBut
should_escalate_for_stagnation(...)depends onedit_fail_tracker.total_failures(), not on stagnation alone.Relevant code:
src/app/agentic.rsaround the Issue332 blocksrc/app/edit_fail_tracker.rs::should_escalate_for_stagnationsrc/app/agentic.rsedit-fail tracker updates for failedfile.edit/file.edit_anchor/file.rewriteThis means the current policy only broadens escalation from:
to:
It does not cover the actual Phase2 drift pattern we just observed:
In A1 there was only one visible
file.edit, and it was the late repair-turn attempt. Sototal_failuresnever crossed the Issue332 threshold, andagent.fix_slicewas never forced.This is the primary blocking root cause.
2. Runtime-side
requires_worker_observationsemantics are still too broadIn
src/contracts/mod.rs, runtime validation forRequiresWorkerObservationcurrently returnsSatisfiedwhen any of these are true:worker_observedrepair_turn_observedmutation_observedThat means the runtime still treats repair-only or mutation-only salvage as satisfying a worker-required pack.
This is directly visible in the latest run:
pack_validation_result=mismatchpack validation gate: satisfiedSo the runtime and benchmark still disagree on the definition of worker success.
This is a second runtime-side defect in
Anvil.Why this belongs to Anvil
The latest benchmark runner is doing the intended strict classification:
worker_observed=trueforrequires_worker_observationThe benchmark did not hide the defect. It exposed two runtime gaps:
Fix direction
1. Add a read-drift / no-mutation escalation path that does not require edit failures
Issue332 solved only the cross-path edit-failure case. Phase2 still needs a second trigger for the actual observed pattern:
Candidate implementations:
agent.fix_slicewhen stagnation score stays above threshold for M turns and no mutation has been observed.pre_exit_repairwould otherwise be injected, instead of allowing repair-only salvage to satisfy the loop.The key requirement is that worker escalation must no longer depend exclusively on
edit_fail_tracker.total_failures().2. Tighten runtime pack validation semantics
RequiresWorkerObservationshould only beSatisfiedwhenworker_observed=true.Repair-only or mutation-only salvage may still be tracked for diagnostics, but they should not satisfy the worker-required gate.
That change should be made in the runtime telemetry validation path so:
all agree on the same outcome.
3. Keep repair-salvage telemetry, but treat it as a separate outcome
The existing counters are still useful:
repair_turn_observedfixslice_escalation_repair_salvage_countBut they should stay diagnostic-only and must not stand in for worker success.
4. Add regression coverage for the actual read-heavy drift pattern
Issue332 tests cover stagnation + cumulative edit failures.
They do not yet prove behavior for:
Add a regression that reproduces this specific pattern and asserts:
worker_observed=trueCandidate acceptance criteria
agent.fix_slicefixslice_escalation_stagnation_count > 0is observable in the relevant Phase2 drift pathworker_observed=trueis produced in at least one live Phase2 short-smoke runRequiresWorkerObservationin runtime validation is satisfied only byworker_observed=truepack validation gate: satisfiedDesk-check: if this issue is cleared, does it achieve the original goal?
Conditionally yes.
If this issue is cleared exactly as described above, then the remaining blockers observed in the latest Phase2 run are removed:
Under that condition, a Phase2 retest can genuinely demonstrate the original objective:
However, clearing this issue is still not a mathematical guarantee that the benchmark will pass on the first retest. It is the right root-cause fix, but the actual goal is only achieved once a rerun shows:
worker_observed=trueSo the desk-check conclusion is: