Skip to content

bug: current Phase2 short smoke pack can close complete_unverified with zero mutation, so worker-on gate no longer exercises worker path #329

@Kewton

Description

@Kewton

Summary

Phase2 latest retest on HEAD 4b66172 (Merge pull request #328 from Kewton/feature/issue-327-repair-turn-closure) no longer shows the previous B1 partial close.

In autonomy_v2_p2_fix_slice_v1_20260410_114127:

  • A1: session_completed, internal completion_kind=complete_unverified
  • B1: session_completed, internal completion_kind=complete_unverified

But both runs also end with:

  • changed_files=0
  • diff_stat=no changes
  • fixslice_escalation_count=0
  • pre_exit_repair_injected_count=0
  • pre_exit_repair_consumed_count=0

So the pack has shifted from "red because of closure bug" to "green-ish no-op completion with zero mutation and zero worker evidence."

That is not a valid answer to the Phase2 question. The pack is supposed to validate microtask worker reachability / observability under worker-on conditions, but the current prompt/source pair now allows the model to finish by audit-only closure.

This issue tracks that benchmark invalidation / pack-purpose mismatch.

Reproduction / observed behavior

  • HEAD under test: 4b66172
  • result dir: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/
  • pack: autonomy-v2-p2-fix-slice-v1
  • prompt: commandindextest/prompts/issue193_aligned.md
  • frozen source: anvil_test/benchmark_sources/issue193-mpb_p3
  • provider/model: Ollama qwen3.5:122b + sidecar qwen3.5:9b
  • context/max output: 65536

External result shape

A1_result.json:

  • exit_class=session_completed
  • command_return_code=0
  • changed_files=0
  • diff_stat=no changes

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/A1_result.json

B1_result.json:

  • exit_class=session_completed
  • command_return_code=0
  • changed_files=0
  • diff_stat=no changes

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/B1_result.json

What the logs show

A1.log:

  • completion_kind=complete_unverified plan_items=4 plan_finished=4
  • fixslice_escalation_count=0
  • pre_exit_repair_injected_count=0
  • pre_exit_repair_consumed_count=0

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/A1.log:469-473

B1.log:

  • completion_kind=complete_unverified plan_items=6 plan_finished=6
  • fixslice_escalation_count=0
  • pre_exit_repair_injected_count=0
  • pre_exit_repair_consumed_count=0

Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/B1.log:615-619

So the latest suite is no longer failing because of the old B1 repair path, but it is also not exercising:

  • agent.fix_slice
  • file.rewrite
  • pre-exit repair turn
  • any diff / touched file artifact

Why this is distinct from #321 and #327

vs #327

#327 was about a real runtime correctness bug:

  • repair turn runs
  • response appends or refreshes pending work
  • loop still terminates partial

That failure family is not what the latest suite shows anymore.

The current suite ends complete_unverified with no mutation, no repair-turn telemetry, and no worker telemetry.

vs #321

#321 is about runtime reachability into agent.fix_slice when localized repair drift happens.

That issue may still be real, but the latest pack no longer gives us a trustworthy way to verify it, because the current prompt/source pair now permits the model to close the task as a pure audit/no-change run.

So this issue is about the benchmark pack itself no longer testing the intended property.

Source-backed root cause

1. The prompt explicitly allows an audit-only / no-change completion

commandindextest/prompts/issue193_aligned.md says this is a:

  • follow-up audit / completion task
  • on the current codebase
  • where already-implemented items must be treated as already done / no changes needed
  • and the agent should fix only the remaining real gap

It also explicitly says not to re-implement:

  • DetectPromptOptions
  • buildDetectPromptOptions()
  • detectPromptWithOptions()
  • the prompt-response route path
  • the status-detector path
  • the current-output indirect path
  • the fact that src/lib/auto-yes-poller.ts is the live polling implementation

Source: commandindextest/prompts/issue193_aligned.md

So if the frozen source already contains those pieces, then "no changes needed" is a prompt-compliant success path.

2. The frozen benchmark source already contains the named live-path pieces

In anvil_test/benchmark_sources/issue193-mpb_p3:

  • src/lib/detection/cli-patterns.ts defines buildDetectPromptOptions()
  • src/lib/auto-yes-poller.ts imports and uses buildDetectPromptOptions() and calls it at line 325
  • src/lib/polling/response-poller.ts contains detectPromptWithOptions() and uses buildDetectPromptOptions()
  • src/lib/detection/status-detector.ts imports and uses buildDetectPromptOptions()
  • src/app/api/worktrees/[id]/prompt-response/route.ts imports and uses buildDetectPromptOptions()
  • src/app/api/worktrees/[id]/current-output/route.ts uses detectSessionStatus() rather than calling detectPrompt() directly

Source:

  • anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/detection/cli-patterns.ts
  • anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/auto-yes-poller.ts
  • anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/polling/response-poller.ts
  • anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/detection/status-detector.ts
  • anvil_test/benchmark_sources/issue193-mpb_p3/src/app/api/worktrees/[id]/prompt-response/route.ts
  • anvil_test/benchmark_sources/issue193-mpb_p3/src/app/api/worktrees/[id]/current-output/route.ts

In other words, the prompt's "already done / no duplicate implementation" guidance now matches the frozen source too well.

3. Phase2 pack intent and prompt/source semantics now contradict each other

The Phase2 pack is named and used as:

  • autonomy-v2-p2-fix-slice-v1
  • a worker-on short smoke gate

But the current prompt/source pair no longer demands a real localized repair.

So the pack currently asks two incompatible things at once:

  1. prompt semantics: "close as already-done if the real gap is gone"
  2. pack semantics: "produce worker-path / mutation evidence or fail Stop rule 3"

That makes the pack invalid as a worker-validation gate. A no-op completion is simultaneously:

  • prompt-compliant
  • but gate-invalid

4. The current harness does not distinguish healthy no-op audit from worker-validation mismatch

Today the result artifacts tell us:

  • session_completed
  • complete_unverified
  • changed_files=0
  • diff_stat=no changes
  • fixslice_escalation_count=0

But they do not classify that as:

  • pack invalid for this purpose
  • prompt/source no-op
  • worker path unexercised because task required no mutation

So the same run can be misread as either:

  • "runtime improved"
  • or "worker still unreachable"
  • or "benchmark is stale"

without first reconstructing the prompt/source semantics manually.

Impact

Fix direction

1. Split audit / no-op-allowed packs from worker-validation packs

If a prompt is designed as a follow-up audit where "already done / no changes needed" is a valid success path, it should not be used as the primary Phase2 worker observability gate.

At minimum, manifests should distinguish:

  • audit_only
  • requires_mutation
  • requires_worker_observation

2. Refresh the Phase2 prompt/source pair so at least one real localized gap remains

For the worker-validation pack, freeze a source revision and prompt where:

  • at least one bounded, real, localized gap is definitely missing
  • the intended successful path must mutate code
  • that mutation is small enough that agent.fix_slice remains a plausible bounded worker path

Without that, the pack cannot answer the Phase2 question.

3. Add harness-level expectation mismatch classification

When a pack marked requires_mutation or requires_worker_observation ends with:

  • complete_unverified
  • changed_files=0
  • diff_stat=no changes
  • fixslice_escalation_count=0

the harness should classify that explicitly as something like:

  • pack_expectation_mismatch
  • benchmark_invalid
  • or mutationless_complete_on_worker_gate

instead of leaving it to manual interpretation.

4. Add result fields for observability

Suggested fields:

  • worker_observed
  • repair_turn_observed
  • mutation_observed
  • pack_expectation
  • expectation_mismatch_reason

That would make this failure shape first-class in artifacts and reports.

5. Keep runtime worker issues separate from benchmark invalidation

After the pack is refreshed, rerun the same Phase2 gate.

If worker path is still absent on a pack that truly requires mutation, then that is runtime territory again and should stay with issues like #321.

Acceptance criteria

  • A Phase2 worker-validation pack is defined where the frozen source definitely contains at least one real bounded missing gap
  • The pack manifest or harness explicitly declares whether mutation / worker observation is required
  • A pack that requires mutation can no longer silently close as complete_unverified with changed_files=0 without being classified as expectation mismatch
  • Reports and result JSON expose whether worker path, repair-turn path, and mutation path were actually observed
  • Re-running the refreshed Phase2 worker-validation pack yields one of only two interpretable outcomes:
    • real worker/diff evidence is observed
    • expectation mismatch is raised explicitly

Desk-check: if this issue is cleared, does it achieve the original goal?

Short answer: no, not by itself. It restores the benchmark's ability to answer the question, but it does not itself prove Phase2 is complete.

What clearing this issue should achieve

If this issue is fixed, the Phase2 benchmark will again become meaningful as a worker-on gate:

  • either it will exercise a real localized mutation path
  • or it will explicitly say the pack is invalid for worker validation

That removes the current ambiguity where a no-op audit completion is being interpreted inside a worker-validation workflow.

Why it is still not sufficient alone

Even with a corrected pack, runtime questions remain separate:

  • does agent.fix_slice actually become reachable?
  • does parent-applies-change produce diff/touched_files evidence?
  • does worker-on improve or at least not regress valid completion?

Those are still Phase2 runtime questions, currently represented by issues like #321.

So the desk-check conclusion is:

  • Yes: clearing this issue is necessary to make the Phase2 gate meaningful again
  • No: clearing this issue alone does not achieve the original Phase2 objective, because runtime worker reachability / adoption must still be proven on the refreshed pack

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions