bug: current Phase2 short smoke pack can close complete_unverified with zero mutation, so worker-on gate no longer exercises worker path

## Summary

Phase2 latest retest on HEAD `4b66172` (`Merge pull request #328 from Kewton/feature/issue-327-repair-turn-closure`) no longer shows the previous B1 `partial` close.

In `autonomy_v2_p2_fix_slice_v1_20260410_114127`:

- A1: `session_completed`, internal `completion_kind=complete_unverified`
- B1: `session_completed`, internal `completion_kind=complete_unverified`

But both runs also end with:

- `changed_files=0`
- `diff_stat=no changes`
- `fixslice_escalation_count=0`
- `pre_exit_repair_injected_count=0`
- `pre_exit_repair_consumed_count=0`

So the pack has shifted from "red because of closure bug" to "green-ish no-op completion with zero mutation and zero worker evidence."

That is not a valid answer to the Phase2 question. The pack is supposed to validate microtask worker reachability / observability under worker-on conditions, but the current prompt/source pair now allows the model to finish by audit-only closure.

This issue tracks that **benchmark invalidation / pack-purpose mismatch**.

## Reproduction / observed behavior

- HEAD under test: `4b66172`
- result dir: `commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/`
- pack: `autonomy-v2-p2-fix-slice-v1`
- prompt: `commandindextest/prompts/issue193_aligned.md`
- frozen source: `anvil_test/benchmark_sources/issue193-mpb_p3`
- provider/model: Ollama `qwen3.5:122b` + sidecar `qwen3.5:9b`
- context/max output: `65536`

### External result shape

`A1_result.json`:

- `exit_class=session_completed`
- `command_return_code=0`
- `changed_files=0`
- `diff_stat=no changes`

Source: `commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/A1_result.json`

`B1_result.json`:

- `exit_class=session_completed`
- `command_return_code=0`
- `changed_files=0`
- `diff_stat=no changes`

Source: `commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/B1_result.json`

### What the logs show

`A1.log`:

- `completion_kind=complete_unverified plan_items=4 plan_finished=4`
- `fixslice_escalation_count=0`
- `pre_exit_repair_injected_count=0`
- `pre_exit_repair_consumed_count=0`

Source: `commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/A1.log:469-473`

`B1.log`:

- `completion_kind=complete_unverified plan_items=6 plan_finished=6`
- `fixslice_escalation_count=0`
- `pre_exit_repair_injected_count=0`
- `pre_exit_repair_consumed_count=0`

Source: `commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/B1.log:615-619`

So the latest suite is no longer failing because of the old B1 repair path, but it is also not exercising:

- `agent.fix_slice`
- `file.rewrite`
- pre-exit repair turn
- any diff / touched file artifact

## Why this is distinct from #321 and #327

### vs #327

#327 was about a real runtime correctness bug:

- repair turn runs
- response appends or refreshes pending work
- loop still terminates `partial`

That failure family is not what the latest suite shows anymore.

The current suite ends `complete_unverified` with no mutation, no repair-turn telemetry, and no worker telemetry.

### vs #321

#321 is about runtime reachability into `agent.fix_slice` when localized repair drift happens.

That issue may still be real, but the latest pack no longer gives us a trustworthy way to verify it, because the current prompt/source pair now permits the model to close the task as a pure audit/no-change run.

So this issue is about **the benchmark pack itself no longer testing the intended property**.

## Source-backed root cause

### 1. The prompt explicitly allows an audit-only / no-change completion

`commandindextest/prompts/issue193_aligned.md` says this is a:

- `follow-up audit / completion task`
- on the `current codebase`
- where already-implemented items must be treated as `already done / no changes needed`
- and the agent should fix only the `remaining real gap`

It also explicitly says not to re-implement:

- `DetectPromptOptions`
- `buildDetectPromptOptions()`
- `detectPromptWithOptions()`
- the prompt-response route path
- the status-detector path
- the current-output indirect path
- the fact that `src/lib/auto-yes-poller.ts` is the live polling implementation

Source: `commandindextest/prompts/issue193_aligned.md`

So if the frozen source already contains those pieces, then "no changes needed" is a prompt-compliant success path.

### 2. The frozen benchmark source already contains the named live-path pieces

In `anvil_test/benchmark_sources/issue193-mpb_p3`:

- `src/lib/detection/cli-patterns.ts` defines `buildDetectPromptOptions()`
- `src/lib/auto-yes-poller.ts` imports and uses `buildDetectPromptOptions()` and calls it at line 325
- `src/lib/polling/response-poller.ts` contains `detectPromptWithOptions()` and uses `buildDetectPromptOptions()`
- `src/lib/detection/status-detector.ts` imports and uses `buildDetectPromptOptions()`
- `src/app/api/worktrees/[id]/prompt-response/route.ts` imports and uses `buildDetectPromptOptions()`
- `src/app/api/worktrees/[id]/current-output/route.ts` uses `detectSessionStatus()` rather than calling `detectPrompt()` directly

Source:

- `anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/detection/cli-patterns.ts`
- `anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/auto-yes-poller.ts`
- `anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/polling/response-poller.ts`
- `anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/detection/status-detector.ts`
- `anvil_test/benchmark_sources/issue193-mpb_p3/src/app/api/worktrees/[id]/prompt-response/route.ts`
- `anvil_test/benchmark_sources/issue193-mpb_p3/src/app/api/worktrees/[id]/current-output/route.ts`

In other words, the prompt's "already done / no duplicate implementation" guidance now matches the frozen source too well.

### 3. Phase2 pack intent and prompt/source semantics now contradict each other

The Phase2 pack is named and used as:

- `autonomy-v2-p2-fix-slice-v1`
- a worker-on short smoke gate

But the current prompt/source pair no longer demands a real localized repair.

So the pack currently asks two incompatible things at once:

1. prompt semantics: "close as already-done if the real gap is gone"
2. pack semantics: "produce worker-path / mutation evidence or fail Stop rule 3"

That makes the pack invalid as a worker-validation gate. A no-op completion is simultaneously:

- prompt-compliant
- but gate-invalid

### 4. The current harness does not distinguish `healthy no-op audit` from `worker-validation mismatch`

Today the result artifacts tell us:

- `session_completed`
- `complete_unverified`
- `changed_files=0`
- `diff_stat=no changes`
- `fixslice_escalation_count=0`

But they do not classify that as:

- `pack invalid for this purpose`
- `prompt/source no-op`
- `worker path unexercised because task required no mutation`

So the same run can be misread as either:

- "runtime improved"
- or "worker still unreachable"
- or "benchmark is stale"

without first reconstructing the prompt/source semantics manually.

## Impact

- Phase2 can appear to improve externally while still not giving any evidence about worker reachability
- The latest suite can no longer answer whether `agent.fix_slice` is reachable in realistic worker-on conditions
- Stop rule 3 is currently being tripped by a pack that no longer guarantees a mutation opportunity
- Runtime issues like #321 can neither be confirmed nor falsified cleanly by this pack

## Fix direction

### 1. Split `audit / no-op-allowed` packs from `worker-validation` packs

If a prompt is designed as a follow-up audit where "already done / no changes needed" is a valid success path, it should not be used as the primary Phase2 worker observability gate.

At minimum, manifests should distinguish:

- `audit_only`
- `requires_mutation`
- `requires_worker_observation`

### 2. Refresh the Phase2 prompt/source pair so at least one real localized gap remains

For the worker-validation pack, freeze a source revision and prompt where:

- at least one bounded, real, localized gap is definitely missing
- the intended successful path must mutate code
- that mutation is small enough that `agent.fix_slice` remains a plausible bounded worker path

Without that, the pack cannot answer the Phase2 question.

### 3. Add harness-level expectation mismatch classification

When a pack marked `requires_mutation` or `requires_worker_observation` ends with:

- `complete_unverified`
- `changed_files=0`
- `diff_stat=no changes`
- `fixslice_escalation_count=0`

the harness should classify that explicitly as something like:

- `pack_expectation_mismatch`
- `benchmark_invalid`
- or `mutationless_complete_on_worker_gate`

instead of leaving it to manual interpretation.

### 4. Add result fields for observability

Suggested fields:

- `worker_observed`
- `repair_turn_observed`
- `mutation_observed`
- `pack_expectation`
- `expectation_mismatch_reason`

That would make this failure shape first-class in artifacts and reports.

### 5. Keep runtime worker issues separate from benchmark invalidation

After the pack is refreshed, rerun the same Phase2 gate.

If worker path is still absent on a pack that truly requires mutation, then that is runtime territory again and should stay with issues like #321.

## Acceptance criteria

- [ ] A Phase2 worker-validation pack is defined where the frozen source definitely contains at least one real bounded missing gap
- [ ] The pack manifest or harness explicitly declares whether mutation / worker observation is required
- [ ] A pack that requires mutation can no longer silently close as `complete_unverified` with `changed_files=0` without being classified as expectation mismatch
- [ ] Reports and result JSON expose whether worker path, repair-turn path, and mutation path were actually observed
- [ ] Re-running the refreshed Phase2 worker-validation pack yields one of only two interpretable outcomes:
  - real worker/diff evidence is observed
  - expectation mismatch is raised explicitly

## Desk-check: if this issue is cleared, does it achieve the original goal?

**Short answer: no, not by itself. It restores the benchmark's ability to answer the question, but it does not itself prove Phase2 is complete.**

### What clearing this issue should achieve

If this issue is fixed, the Phase2 benchmark will again become meaningful as a worker-on gate:

- either it will exercise a real localized mutation path
- or it will explicitly say the pack is invalid for worker validation

That removes the current ambiguity where a no-op audit completion is being interpreted inside a worker-validation workflow.

### Why it is still not sufficient alone

Even with a corrected pack, runtime questions remain separate:

- does `agent.fix_slice` actually become reachable?
- does parent-applies-change produce diff/touched_files evidence?
- does worker-on improve or at least not regress valid completion?

Those are still Phase2 runtime questions, currently represented by issues like #321.

So the desk-check conclusion is:

- **Yes**: clearing this issue is necessary to make the Phase2 gate meaningful again
- **No**: clearing this issue alone does not achieve the original Phase2 objective, because runtime worker reachability / adoption must still be proven on the refreshed pack


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: current Phase2 short smoke pack can close complete_unverified with zero mutation, so worker-on gate no longer exercises worker path #329

Summary

Reproduction / observed behavior

External result shape

What the logs show

Why this is distinct from #321 and #327

vs #327

vs #321

Source-backed root cause

1. The prompt explicitly allows an audit-only / no-change completion

2. The frozen benchmark source already contains the named live-path pieces

3. Phase2 pack intent and prompt/source semantics now contradict each other

4. The current harness does not distinguish `healthy no-op audit` from `worker-validation mismatch`

Impact

Fix direction

1. Split `audit / no-op-allowed` packs from `worker-validation` packs

2. Refresh the Phase2 prompt/source pair so at least one real localized gap remains

3. Add harness-level expectation mismatch classification

4. Add result fields for observability

5. Keep runtime worker issues separate from benchmark invalidation

Acceptance criteria

Desk-check: if this issue is cleared, does it achieve the original goal?

What clearing this issue should achieve

Why it is still not sufficient alone

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: current Phase2 short smoke pack can close complete_unverified with zero mutation, so worker-on gate no longer exercises worker path #329

Description

Summary

Reproduction / observed behavior

External result shape

What the logs show

Why this is distinct from #321 and #327

vs #327

vs #321

Source-backed root cause

1. The prompt explicitly allows an audit-only / no-change completion

2. The frozen benchmark source already contains the named live-path pieces

3. Phase2 pack intent and prompt/source semantics now contradict each other

4. The current harness does not distinguish healthy no-op audit from worker-validation mismatch

Impact

Fix direction

1. Split audit / no-op-allowed packs from worker-validation packs

2. Refresh the Phase2 prompt/source pair so at least one real localized gap remains

3. Add harness-level expectation mismatch classification

4. Add result fields for observability

5. Keep runtime worker issues separate from benchmark invalidation

Acceptance criteria

Desk-check: if this issue is cleared, does it achieve the original goal?

What clearing this issue should achieve

Why it is still not sufficient alone

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

4. The current harness does not distinguish `healthy no-op audit` from `worker-validation mismatch`

1. Split `audit / no-op-allowed` packs from `worker-validation` packs