Phase2: read-heavy stagnation still does not reach worker path after Issue332

## Summary

Phase2 short smoke after Issue332 still fails the worker-required gate.

The latest run on Anvil `95727f76ac512ddcf4a7f973e7449f59c544b63b` improved termination behavior to `session_completed` / `complete_unverified`, but it still did not produce live worker-path evidence:

- `worker_observed=false`
- `fixslice_escalation_count=0`
- `fixslice_escalation_stagnation_count=0`
- `changed_files=0`
- `diff_stat=no changes`
- `repair_turn_observed=true`
- runner result: `exit_class=expectation_mismatch`

This is a runtime behavior problem in `Anvil`, not a benchmark harness misclassification.

## Reproduction / observed behavior

Latest suite:

- results dir: `commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_162147/`
- pack: `autonomy-v2-p2-fix-slice-v1`
- expectation: `requires_worker_observation`
- provider/model: Ollama `qwen3.5:122b` + sidecar `qwen3.5:9b`

### A1 external result

`A1_result.json`:

- `valid_run=true`
- `exit_class=expectation_mismatch`
- `command_return_code=0`
- `changed_files=0`
- `diff_stat=no changes`
- `worker_observed=false`
- `repair_turn_observed=true`
- `mutation_observed=false`
- `expectation_mismatch_reason="pack requires worker_observed=true but worker path was not observed (repair_turn_observed=true)"`

### A1 telemetry / log highlights

`telemetry/A1/..._telemetry.json`:

- `completion_kind=complete_unverified`
- `worker_observed=false`
- `repair_turn_observed=true`
- `fixslice_escalation_count=0`
- `fixslice_escalation_stagnation_count=0`

`A1.log`:

- repeated `stagnation detected; forced_mode_active=true`
- `pre-exit repair turn injected`
- `pre-exit repair turn consumed; plan cleanly finished`
- `pack validation gate: satisfied`

So the runtime is still taking the path:

1. repeated read / audit drift
2. stagnation warnings
3. no worker escalation
4. late repair-turn salvage
5. `complete_unverified`
6. zero persistent diff

## Root cause

### 1. The broadened Issue332 escalation is still gated on edit failures

The stagnation-driven worker escalation in `src/app/agentic.rs` only fires when:

- `worker_observed` is still false
- `should_escalate_for_stagnation(...)` returns true

But `should_escalate_for_stagnation(...)` depends on `edit_fail_tracker.total_failures()`, not on stagnation alone.

Relevant code:

- `src/app/agentic.rs` around the Issue332 block
- `src/app/edit_fail_tracker.rs::should_escalate_for_stagnation`
- `src/app/agentic.rs` edit-fail tracker updates for failed `file.edit` / `file.edit_anchor` / `file.rewrite`

This means the current policy only broadens escalation from:

- same-path repeated failures

to:

- cross-path repeated edit failures

It does **not** cover the actual Phase2 drift pattern we just observed:

- repeated reads
- repeated replans / plan-aware suppression
- stagnation score high
- almost no failed edit attempts before pre-exit repair

In A1 there was only one visible `file.edit`, and it was the late repair-turn attempt. So `total_failures` never crossed the Issue332 threshold, and `agent.fix_slice` was never forced.

This is the primary blocking root cause.

### 2. Runtime-side `requires_worker_observation` semantics are still too broad

In `src/contracts/mod.rs`, runtime validation for `RequiresWorkerObservation` currently returns `Satisfied` when **any** of these are true:

- `worker_observed`
- `repair_turn_observed`
- `mutation_observed`

That means the runtime still treats repair-only or mutation-only salvage as satisfying a worker-required pack.

This is directly visible in the latest run:

- runner result: `pack_validation_result=mismatch`
- runtime log: `pack validation gate: satisfied`

So the runtime and benchmark still disagree on the definition of worker success.

This is a second runtime-side defect in `Anvil`.

## Why this belongs to Anvil

The latest benchmark runner is doing the intended strict classification:

- it requires `worker_observed=true` for `requires_worker_observation`
- it explicitly marks repair-only salvage as mismatch

The benchmark did not hide the defect. It exposed two runtime gaps:

1. worker escalation policy does not match the real drift pattern
2. runtime pack validation semantics still accept non-worker salvage

## Fix direction

### 1. Add a read-drift / no-mutation escalation path that does not require edit failures

Issue332 solved only the cross-path edit-failure case. Phase2 still needs a second trigger for the actual observed pattern:

- stagnation score high
- repeated read / search loops
- no mutation for N turns
- plan repair / final suppression already happened

Candidate implementations:

1. Escalate to `agent.fix_slice` when stagnation score stays above threshold for M turns and no mutation has been observed.
2. Escalate when repeated-read warnings or forced workset transitions cross a threshold under worker-on Phase2 conditions.
3. Escalate after `pre_exit_repair` would otherwise be injected, instead of allowing repair-only salvage to satisfy the loop.

The key requirement is that worker escalation must no longer depend exclusively on `edit_fail_tracker.total_failures()`.

### 2. Tighten runtime pack validation semantics

`RequiresWorkerObservation` should only be `Satisfied` when `worker_observed=true`.

Repair-only or mutation-only salvage may still be tracked for diagnostics, but they should not satisfy the worker-required gate.

That change should be made in the runtime telemetry validation path so:

- runtime log
- telemetry artifact
- benchmark result

all agree on the same outcome.

### 3. Keep repair-salvage telemetry, but treat it as a separate outcome

The existing counters are still useful:

- `repair_turn_observed`
- `fixslice_escalation_repair_salvage_count`

But they should stay diagnostic-only and must not stand in for worker success.

### 4. Add regression coverage for the actual read-heavy drift pattern

Issue332 tests cover stagnation + cumulative edit failures.
They do not yet prove behavior for:

- high stagnation
- repeated reads
- no persistent mutation
- no edit-failure accumulation

Add a regression that reproduces this specific pattern and asserts:

- worker escalation happens before repair-only closure
- `worker_observed=true`
- runtime pack validation is mismatch unless worker is actually observed

## Candidate acceptance criteria

- [ ] A read-heavy stagnation pattern with no accumulating edit failures can still trigger `agent.fix_slice`
- [ ] `fixslice_escalation_stagnation_count > 0` is observable in the relevant Phase2 drift path
- [ ] `worker_observed=true` is produced in at least one live Phase2 short-smoke run
- [ ] `RequiresWorkerObservation` in runtime validation is satisfied only by `worker_observed=true`
- [ ] repair-only salvage no longer produces runtime `pack validation gate: satisfied`
- [ ] regression tests cover read-heavy stagnation, not only cross-path edit failures

## Desk-check: if this issue is cleared, does it achieve the original goal?

**Conditionally yes.**

If this issue is cleared exactly as described above, then the remaining blockers observed in the latest Phase2 run are removed:

1. the runtime can escalate to worker in the real read-heavy drift pattern
2. runtime gate semantics match the worker-required benchmark gate

Under that condition, a Phase2 retest can genuinely demonstrate the original objective:

- live worker-path observation
- non-zero worker-mediated mutation evidence
- no false pass from repair-only salvage

However, clearing this issue is still not a mathematical guarantee that the benchmark will pass on the first retest. It is the right root-cause fix, but the actual goal is only achieved once a rerun shows:

- `worker_observed=true`
- persistent diff / touched-files evidence
- no regression in completion quality

So the desk-check conclusion is:

- **Yes as a necessary fix**: this issue addresses the runtime defects actually blocking Phase2
- **Conditionally yes as a sufficient practical fix**: if the implementation matches the acceptance criteria, the original Phase2 objective should become reachable on retest
- **Final proof still requires rerun evidence**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase2: read-heavy stagnation still does not reach worker path after Issue332 #334

Summary

Reproduction / observed behavior

A1 external result

A1 telemetry / log highlights

Root cause

1. The broadened Issue332 escalation is still gated on edit failures

2. Runtime-side `requires_worker_observation` semantics are still too broad

Why this belongs to Anvil

Fix direction

1. Add a read-drift / no-mutation escalation path that does not require edit failures

2. Tighten runtime pack validation semantics

3. Keep repair-salvage telemetry, but treat it as a separate outcome

4. Add regression coverage for the actual read-heavy drift pattern

Candidate acceptance criteria

Desk-check: if this issue is cleared, does it achieve the original goal?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Phase2: read-heavy stagnation still does not reach worker path after Issue332 #334

Description

Summary

Reproduction / observed behavior

A1 external result

A1 telemetry / log highlights

Root cause

1. The broadened Issue332 escalation is still gated on edit failures

2. Runtime-side requires_worker_observation semantics are still too broad

Why this belongs to Anvil

Fix direction

1. Add a read-drift / no-mutation escalation path that does not require edit failures

2. Tighten runtime pack validation semantics

3. Keep repair-salvage telemetry, but treat it as a separate outcome

4. Add regression coverage for the actual read-heavy drift pattern

Candidate acceptance criteria

Desk-check: if this issue is cleared, does it achieve the original goal?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Runtime-side `requires_worker_observation` semantics are still too broad