Skip to content

Phase2: read-heavy stagnation still does not reach worker path after Issue332 #334

@Kewton

Description

@Kewton

Summary

Phase2 short smoke after Issue332 still fails the worker-required gate.

The latest run on Anvil 95727f76ac512ddcf4a7f973e7449f59c544b63b improved termination behavior to session_completed / complete_unverified, but it still did not produce live worker-path evidence:

  • worker_observed=false
  • fixslice_escalation_count=0
  • fixslice_escalation_stagnation_count=0
  • changed_files=0
  • diff_stat=no changes
  • repair_turn_observed=true
  • runner result: exit_class=expectation_mismatch

This is a runtime behavior problem in Anvil, not a benchmark harness misclassification.

Reproduction / observed behavior

Latest suite:

  • results dir: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_162147/
  • pack: autonomy-v2-p2-fix-slice-v1
  • expectation: requires_worker_observation
  • provider/model: Ollama qwen3.5:122b + sidecar qwen3.5:9b

A1 external result

A1_result.json:

  • valid_run=true
  • exit_class=expectation_mismatch
  • command_return_code=0
  • changed_files=0
  • diff_stat=no changes
  • worker_observed=false
  • repair_turn_observed=true
  • mutation_observed=false
  • expectation_mismatch_reason="pack requires worker_observed=true but worker path was not observed (repair_turn_observed=true)"

A1 telemetry / log highlights

telemetry/A1/..._telemetry.json:

  • completion_kind=complete_unverified
  • worker_observed=false
  • repair_turn_observed=true
  • fixslice_escalation_count=0
  • fixslice_escalation_stagnation_count=0

A1.log:

  • repeated stagnation detected; forced_mode_active=true
  • pre-exit repair turn injected
  • pre-exit repair turn consumed; plan cleanly finished
  • pack validation gate: satisfied

So the runtime is still taking the path:

  1. repeated read / audit drift
  2. stagnation warnings
  3. no worker escalation
  4. late repair-turn salvage
  5. complete_unverified
  6. zero persistent diff

Root cause

1. The broadened Issue332 escalation is still gated on edit failures

The stagnation-driven worker escalation in src/app/agentic.rs only fires when:

  • worker_observed is still false
  • should_escalate_for_stagnation(...) returns true

But should_escalate_for_stagnation(...) depends on edit_fail_tracker.total_failures(), not on stagnation alone.

Relevant code:

  • src/app/agentic.rs around the Issue332 block
  • src/app/edit_fail_tracker.rs::should_escalate_for_stagnation
  • src/app/agentic.rs edit-fail tracker updates for failed file.edit / file.edit_anchor / file.rewrite

This means the current policy only broadens escalation from:

  • same-path repeated failures

to:

  • cross-path repeated edit failures

It does not cover the actual Phase2 drift pattern we just observed:

  • repeated reads
  • repeated replans / plan-aware suppression
  • stagnation score high
  • almost no failed edit attempts before pre-exit repair

In A1 there was only one visible file.edit, and it was the late repair-turn attempt. So total_failures never crossed the Issue332 threshold, and agent.fix_slice was never forced.

This is the primary blocking root cause.

2. Runtime-side requires_worker_observation semantics are still too broad

In src/contracts/mod.rs, runtime validation for RequiresWorkerObservation currently returns Satisfied when any of these are true:

  • worker_observed
  • repair_turn_observed
  • mutation_observed

That means the runtime still treats repair-only or mutation-only salvage as satisfying a worker-required pack.

This is directly visible in the latest run:

  • runner result: pack_validation_result=mismatch
  • runtime log: pack validation gate: satisfied

So the runtime and benchmark still disagree on the definition of worker success.

This is a second runtime-side defect in Anvil.

Why this belongs to Anvil

The latest benchmark runner is doing the intended strict classification:

  • it requires worker_observed=true for requires_worker_observation
  • it explicitly marks repair-only salvage as mismatch

The benchmark did not hide the defect. It exposed two runtime gaps:

  1. worker escalation policy does not match the real drift pattern
  2. runtime pack validation semantics still accept non-worker salvage

Fix direction

1. Add a read-drift / no-mutation escalation path that does not require edit failures

Issue332 solved only the cross-path edit-failure case. Phase2 still needs a second trigger for the actual observed pattern:

  • stagnation score high
  • repeated read / search loops
  • no mutation for N turns
  • plan repair / final suppression already happened

Candidate implementations:

  1. Escalate to agent.fix_slice when stagnation score stays above threshold for M turns and no mutation has been observed.
  2. Escalate when repeated-read warnings or forced workset transitions cross a threshold under worker-on Phase2 conditions.
  3. Escalate after pre_exit_repair would otherwise be injected, instead of allowing repair-only salvage to satisfy the loop.

The key requirement is that worker escalation must no longer depend exclusively on edit_fail_tracker.total_failures().

2. Tighten runtime pack validation semantics

RequiresWorkerObservation should only be Satisfied when worker_observed=true.

Repair-only or mutation-only salvage may still be tracked for diagnostics, but they should not satisfy the worker-required gate.

That change should be made in the runtime telemetry validation path so:

  • runtime log
  • telemetry artifact
  • benchmark result

all agree on the same outcome.

3. Keep repair-salvage telemetry, but treat it as a separate outcome

The existing counters are still useful:

  • repair_turn_observed
  • fixslice_escalation_repair_salvage_count

But they should stay diagnostic-only and must not stand in for worker success.

4. Add regression coverage for the actual read-heavy drift pattern

Issue332 tests cover stagnation + cumulative edit failures.
They do not yet prove behavior for:

  • high stagnation
  • repeated reads
  • no persistent mutation
  • no edit-failure accumulation

Add a regression that reproduces this specific pattern and asserts:

  • worker escalation happens before repair-only closure
  • worker_observed=true
  • runtime pack validation is mismatch unless worker is actually observed

Candidate acceptance criteria

  • A read-heavy stagnation pattern with no accumulating edit failures can still trigger agent.fix_slice
  • fixslice_escalation_stagnation_count > 0 is observable in the relevant Phase2 drift path
  • worker_observed=true is produced in at least one live Phase2 short-smoke run
  • RequiresWorkerObservation in runtime validation is satisfied only by worker_observed=true
  • repair-only salvage no longer produces runtime pack validation gate: satisfied
  • regression tests cover read-heavy stagnation, not only cross-path edit failures

Desk-check: if this issue is cleared, does it achieve the original goal?

Conditionally yes.

If this issue is cleared exactly as described above, then the remaining blockers observed in the latest Phase2 run are removed:

  1. the runtime can escalate to worker in the real read-heavy drift pattern
  2. runtime gate semantics match the worker-required benchmark gate

Under that condition, a Phase2 retest can genuinely demonstrate the original objective:

  • live worker-path observation
  • non-zero worker-mediated mutation evidence
  • no false pass from repair-only salvage

However, clearing this issue is still not a mathematical guarantee that the benchmark will pass on the first retest. It is the right root-cause fix, but the actual goal is only achieved once a rerun shows:

  • worker_observed=true
  • persistent diff / touched-files evidence
  • no regression in completion quality

So the desk-check conclusion is:

  • Yes as a necessary fix: this issue addresses the runtime defects actually blocking Phase2
  • Conditionally yes as a sufficient practical fix: if the implementation matches the acceptance criteria, the original Phase2 objective should become reachable on retest
  • Final proof still requires rerun evidence

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions