Summary
Phase2 latest retest on HEAD 4b66172 (Merge pull request #328 from Kewton/feature/issue-327-repair-turn-closure) no longer shows the previous B1 partial close.
In autonomy_v2_p2_fix_slice_v1_20260410_114127:
- A1:
session_completed, internal completion_kind=complete_unverified
- B1:
session_completed, internal completion_kind=complete_unverified
But both runs also end with:
changed_files=0
diff_stat=no changes
fixslice_escalation_count=0
pre_exit_repair_injected_count=0
pre_exit_repair_consumed_count=0
So the pack has shifted from "red because of closure bug" to "green-ish no-op completion with zero mutation and zero worker evidence."
That is not a valid answer to the Phase2 question. The pack is supposed to validate microtask worker reachability / observability under worker-on conditions, but the current prompt/source pair now allows the model to finish by audit-only closure.
This issue tracks that benchmark invalidation / pack-purpose mismatch.
Reproduction / observed behavior
- HEAD under test:
4b66172
- result dir:
commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/
- pack:
autonomy-v2-p2-fix-slice-v1
- prompt:
commandindextest/prompts/issue193_aligned.md
- frozen source:
anvil_test/benchmark_sources/issue193-mpb_p3
- provider/model: Ollama
qwen3.5:122b + sidecar qwen3.5:9b
- context/max output:
65536
External result shape
A1_result.json:
exit_class=session_completed
command_return_code=0
changed_files=0
diff_stat=no changes
Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/A1_result.json
B1_result.json:
exit_class=session_completed
command_return_code=0
changed_files=0
diff_stat=no changes
Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/B1_result.json
What the logs show
A1.log:
completion_kind=complete_unverified plan_items=4 plan_finished=4
fixslice_escalation_count=0
pre_exit_repair_injected_count=0
pre_exit_repair_consumed_count=0
Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/A1.log:469-473
B1.log:
completion_kind=complete_unverified plan_items=6 plan_finished=6
fixslice_escalation_count=0
pre_exit_repair_injected_count=0
pre_exit_repair_consumed_count=0
Source: commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/B1.log:615-619
So the latest suite is no longer failing because of the old B1 repair path, but it is also not exercising:
agent.fix_slice
file.rewrite
- pre-exit repair turn
- any diff / touched file artifact
Why this is distinct from #321 and #327
#327 was about a real runtime correctness bug:
- repair turn runs
- response appends or refreshes pending work
- loop still terminates
partial
That failure family is not what the latest suite shows anymore.
The current suite ends complete_unverified with no mutation, no repair-turn telemetry, and no worker telemetry.
#321 is about runtime reachability into agent.fix_slice when localized repair drift happens.
That issue may still be real, but the latest pack no longer gives us a trustworthy way to verify it, because the current prompt/source pair now permits the model to close the task as a pure audit/no-change run.
So this issue is about the benchmark pack itself no longer testing the intended property.
Source-backed root cause
1. The prompt explicitly allows an audit-only / no-change completion
commandindextest/prompts/issue193_aligned.md says this is a:
follow-up audit / completion task
- on the
current codebase
- where already-implemented items must be treated as
already done / no changes needed
- and the agent should fix only the
remaining real gap
It also explicitly says not to re-implement:
DetectPromptOptions
buildDetectPromptOptions()
detectPromptWithOptions()
- the prompt-response route path
- the status-detector path
- the current-output indirect path
- the fact that
src/lib/auto-yes-poller.ts is the live polling implementation
Source: commandindextest/prompts/issue193_aligned.md
So if the frozen source already contains those pieces, then "no changes needed" is a prompt-compliant success path.
2. The frozen benchmark source already contains the named live-path pieces
In anvil_test/benchmark_sources/issue193-mpb_p3:
src/lib/detection/cli-patterns.ts defines buildDetectPromptOptions()
src/lib/auto-yes-poller.ts imports and uses buildDetectPromptOptions() and calls it at line 325
src/lib/polling/response-poller.ts contains detectPromptWithOptions() and uses buildDetectPromptOptions()
src/lib/detection/status-detector.ts imports and uses buildDetectPromptOptions()
src/app/api/worktrees/[id]/prompt-response/route.ts imports and uses buildDetectPromptOptions()
src/app/api/worktrees/[id]/current-output/route.ts uses detectSessionStatus() rather than calling detectPrompt() directly
Source:
anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/detection/cli-patterns.ts
anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/auto-yes-poller.ts
anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/polling/response-poller.ts
anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/detection/status-detector.ts
anvil_test/benchmark_sources/issue193-mpb_p3/src/app/api/worktrees/[id]/prompt-response/route.ts
anvil_test/benchmark_sources/issue193-mpb_p3/src/app/api/worktrees/[id]/current-output/route.ts
In other words, the prompt's "already done / no duplicate implementation" guidance now matches the frozen source too well.
3. Phase2 pack intent and prompt/source semantics now contradict each other
The Phase2 pack is named and used as:
autonomy-v2-p2-fix-slice-v1
- a worker-on short smoke gate
But the current prompt/source pair no longer demands a real localized repair.
So the pack currently asks two incompatible things at once:
- prompt semantics: "close as already-done if the real gap is gone"
- pack semantics: "produce worker-path / mutation evidence or fail Stop rule 3"
That makes the pack invalid as a worker-validation gate. A no-op completion is simultaneously:
- prompt-compliant
- but gate-invalid
4. The current harness does not distinguish healthy no-op audit from worker-validation mismatch
Today the result artifacts tell us:
session_completed
complete_unverified
changed_files=0
diff_stat=no changes
fixslice_escalation_count=0
But they do not classify that as:
pack invalid for this purpose
prompt/source no-op
worker path unexercised because task required no mutation
So the same run can be misread as either:
- "runtime improved"
- or "worker still unreachable"
- or "benchmark is stale"
without first reconstructing the prompt/source semantics manually.
Impact
Fix direction
1. Split audit / no-op-allowed packs from worker-validation packs
If a prompt is designed as a follow-up audit where "already done / no changes needed" is a valid success path, it should not be used as the primary Phase2 worker observability gate.
At minimum, manifests should distinguish:
audit_only
requires_mutation
requires_worker_observation
2. Refresh the Phase2 prompt/source pair so at least one real localized gap remains
For the worker-validation pack, freeze a source revision and prompt where:
- at least one bounded, real, localized gap is definitely missing
- the intended successful path must mutate code
- that mutation is small enough that
agent.fix_slice remains a plausible bounded worker path
Without that, the pack cannot answer the Phase2 question.
3. Add harness-level expectation mismatch classification
When a pack marked requires_mutation or requires_worker_observation ends with:
complete_unverified
changed_files=0
diff_stat=no changes
fixslice_escalation_count=0
the harness should classify that explicitly as something like:
pack_expectation_mismatch
benchmark_invalid
- or
mutationless_complete_on_worker_gate
instead of leaving it to manual interpretation.
4. Add result fields for observability
Suggested fields:
worker_observed
repair_turn_observed
mutation_observed
pack_expectation
expectation_mismatch_reason
That would make this failure shape first-class in artifacts and reports.
5. Keep runtime worker issues separate from benchmark invalidation
After the pack is refreshed, rerun the same Phase2 gate.
If worker path is still absent on a pack that truly requires mutation, then that is runtime territory again and should stay with issues like #321.
Acceptance criteria
Desk-check: if this issue is cleared, does it achieve the original goal?
Short answer: no, not by itself. It restores the benchmark's ability to answer the question, but it does not itself prove Phase2 is complete.
What clearing this issue should achieve
If this issue is fixed, the Phase2 benchmark will again become meaningful as a worker-on gate:
- either it will exercise a real localized mutation path
- or it will explicitly say the pack is invalid for worker validation
That removes the current ambiguity where a no-op audit completion is being interpreted inside a worker-validation workflow.
Why it is still not sufficient alone
Even with a corrected pack, runtime questions remain separate:
- does
agent.fix_slice actually become reachable?
- does parent-applies-change produce diff/touched_files evidence?
- does worker-on improve or at least not regress valid completion?
Those are still Phase2 runtime questions, currently represented by issues like #321.
So the desk-check conclusion is:
- Yes: clearing this issue is necessary to make the Phase2 gate meaningful again
- No: clearing this issue alone does not achieve the original Phase2 objective, because runtime worker reachability / adoption must still be proven on the refreshed pack
Summary
Phase2 latest retest on HEAD
4b66172(Merge pull request #328 from Kewton/feature/issue-327-repair-turn-closure) no longer shows the previous B1partialclose.In
autonomy_v2_p2_fix_slice_v1_20260410_114127:session_completed, internalcompletion_kind=complete_unverifiedsession_completed, internalcompletion_kind=complete_unverifiedBut both runs also end with:
changed_files=0diff_stat=no changesfixslice_escalation_count=0pre_exit_repair_injected_count=0pre_exit_repair_consumed_count=0So the pack has shifted from "red because of closure bug" to "green-ish no-op completion with zero mutation and zero worker evidence."
That is not a valid answer to the Phase2 question. The pack is supposed to validate microtask worker reachability / observability under worker-on conditions, but the current prompt/source pair now allows the model to finish by audit-only closure.
This issue tracks that benchmark invalidation / pack-purpose mismatch.
Reproduction / observed behavior
4b66172commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/autonomy-v2-p2-fix-slice-v1commandindextest/prompts/issue193_aligned.mdanvil_test/benchmark_sources/issue193-mpb_p3qwen3.5:122b+ sidecarqwen3.5:9b65536External result shape
A1_result.json:exit_class=session_completedcommand_return_code=0changed_files=0diff_stat=no changesSource:
commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/A1_result.jsonB1_result.json:exit_class=session_completedcommand_return_code=0changed_files=0diff_stat=no changesSource:
commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/B1_result.jsonWhat the logs show
A1.log:completion_kind=complete_unverified plan_items=4 plan_finished=4fixslice_escalation_count=0pre_exit_repair_injected_count=0pre_exit_repair_consumed_count=0Source:
commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/A1.log:469-473B1.log:completion_kind=complete_unverified plan_items=6 plan_finished=6fixslice_escalation_count=0pre_exit_repair_injected_count=0pre_exit_repair_consumed_count=0Source:
commandindextest/results/autonomy_v2_p2_fix_slice_v1_20260410_114127/B1.log:615-619So the latest suite is no longer failing because of the old B1 repair path, but it is also not exercising:
agent.fix_slicefile.rewriteWhy this is distinct from #321 and #327
vs #327
#327 was about a real runtime correctness bug:
partialThat failure family is not what the latest suite shows anymore.
The current suite ends
complete_unverifiedwith no mutation, no repair-turn telemetry, and no worker telemetry.vs #321
#321 is about runtime reachability into
agent.fix_slicewhen localized repair drift happens.That issue may still be real, but the latest pack no longer gives us a trustworthy way to verify it, because the current prompt/source pair now permits the model to close the task as a pure audit/no-change run.
So this issue is about the benchmark pack itself no longer testing the intended property.
Source-backed root cause
1. The prompt explicitly allows an audit-only / no-change completion
commandindextest/prompts/issue193_aligned.mdsays this is a:follow-up audit / completion taskcurrent codebasealready done / no changes neededremaining real gapIt also explicitly says not to re-implement:
DetectPromptOptionsbuildDetectPromptOptions()detectPromptWithOptions()src/lib/auto-yes-poller.tsis the live polling implementationSource:
commandindextest/prompts/issue193_aligned.mdSo if the frozen source already contains those pieces, then "no changes needed" is a prompt-compliant success path.
2. The frozen benchmark source already contains the named live-path pieces
In
anvil_test/benchmark_sources/issue193-mpb_p3:src/lib/detection/cli-patterns.tsdefinesbuildDetectPromptOptions()src/lib/auto-yes-poller.tsimports and usesbuildDetectPromptOptions()and calls it at line 325src/lib/polling/response-poller.tscontainsdetectPromptWithOptions()and usesbuildDetectPromptOptions()src/lib/detection/status-detector.tsimports and usesbuildDetectPromptOptions()src/app/api/worktrees/[id]/prompt-response/route.tsimports and usesbuildDetectPromptOptions()src/app/api/worktrees/[id]/current-output/route.tsusesdetectSessionStatus()rather than callingdetectPrompt()directlySource:
anvil_test/benchmark_sources/issue193-mpb_p3/src/lib/detection/cli-patterns.tsanvil_test/benchmark_sources/issue193-mpb_p3/src/lib/auto-yes-poller.tsanvil_test/benchmark_sources/issue193-mpb_p3/src/lib/polling/response-poller.tsanvil_test/benchmark_sources/issue193-mpb_p3/src/lib/detection/status-detector.tsanvil_test/benchmark_sources/issue193-mpb_p3/src/app/api/worktrees/[id]/prompt-response/route.tsanvil_test/benchmark_sources/issue193-mpb_p3/src/app/api/worktrees/[id]/current-output/route.tsIn other words, the prompt's "already done / no duplicate implementation" guidance now matches the frozen source too well.
3. Phase2 pack intent and prompt/source semantics now contradict each other
The Phase2 pack is named and used as:
autonomy-v2-p2-fix-slice-v1But the current prompt/source pair no longer demands a real localized repair.
So the pack currently asks two incompatible things at once:
That makes the pack invalid as a worker-validation gate. A no-op completion is simultaneously:
4. The current harness does not distinguish
healthy no-op auditfromworker-validation mismatchToday the result artifacts tell us:
session_completedcomplete_unverifiedchanged_files=0diff_stat=no changesfixslice_escalation_count=0But they do not classify that as:
pack invalid for this purposeprompt/source no-opworker path unexercised because task required no mutationSo the same run can be misread as either:
without first reconstructing the prompt/source semantics manually.
Impact
agent.fix_sliceis reachable in realistic worker-on conditionsFix direction
1. Split
audit / no-op-allowedpacks fromworker-validationpacksIf a prompt is designed as a follow-up audit where "already done / no changes needed" is a valid success path, it should not be used as the primary Phase2 worker observability gate.
At minimum, manifests should distinguish:
audit_onlyrequires_mutationrequires_worker_observation2. Refresh the Phase2 prompt/source pair so at least one real localized gap remains
For the worker-validation pack, freeze a source revision and prompt where:
agent.fix_sliceremains a plausible bounded worker pathWithout that, the pack cannot answer the Phase2 question.
3. Add harness-level expectation mismatch classification
When a pack marked
requires_mutationorrequires_worker_observationends with:complete_unverifiedchanged_files=0diff_stat=no changesfixslice_escalation_count=0the harness should classify that explicitly as something like:
pack_expectation_mismatchbenchmark_invalidmutationless_complete_on_worker_gateinstead of leaving it to manual interpretation.
4. Add result fields for observability
Suggested fields:
worker_observedrepair_turn_observedmutation_observedpack_expectationexpectation_mismatch_reasonThat would make this failure shape first-class in artifacts and reports.
5. Keep runtime worker issues separate from benchmark invalidation
After the pack is refreshed, rerun the same Phase2 gate.
If worker path is still absent on a pack that truly requires mutation, then that is runtime territory again and should stay with issues like #321.
Acceptance criteria
complete_unverifiedwithchanged_files=0without being classified as expectation mismatchDesk-check: if this issue is cleared, does it achieve the original goal?
Short answer: no, not by itself. It restores the benchmark's ability to answer the question, but it does not itself prove Phase2 is complete.
What clearing this issue should achieve
If this issue is fixed, the Phase2 benchmark will again become meaningful as a worker-on gate:
That removes the current ambiguity where a no-op audit completion is being interpreted inside a worker-validation workflow.
Why it is still not sufficient alone
Even with a corrected pack, runtime questions remain separate:
agent.fix_sliceactually become reachable?Those are still Phase2 runtime questions, currently represented by issues like #321.
So the desk-check conclusion is: