[BUG] Managed round evaluator over-indexes on incremental fixes despite high spend and strong strategic critique

## Summary

In managed `round_evaluator` runs, we can spend more on evaluator/criteria subagents than on the parent builder while still converging on mostly incremental changes. The evaluator often *recognizes* higher-order problems, but the system does not reliably turn that diagnosis into a committed new implementation thesis.

This weakens MassGen's self-improvement loop: we pay for multi-agent critique, but the resulting `next_tasks.json` often collapses back to safe local optimization.

## Observed Example

Run analyzed:<br>`/Users/ncrispin/GitHubProjects/MassGenMore/.massgen/massgen_logs/log_20260311_135039_180110`

Known spend from this run:

* Main parent run: `$9.568984`
* Criteria-generation subagent: `$0.808529`
* Round evaluator sub-run: `$10.914117`
* Known total: `$21.29163`
* Subagents share of known total: `55.1%`
* Round evaluator share of known total: `51.3%`

The second answer *did* improve materially, but not enough to justify the evaluator spend. The main pattern was:

* evaluator packet identified deeper issues around ceiling / generic-template feel / lack of stronger thesis
* injected task plan still mostly resolved to corrective tasks and locally-verifiable upgrades
* builder completed those tasks in a relatively conservative way
* run did not cleanly converge; parent submitted answer 2, then started another evaluator stage and stalled

## Why This Matters

The current behavior encourages a bad equilibrium:

* critique is sharper than execution
* strategic diagnosis is more ambitious than the machine-readable handoff
* builders can satisfy transformative tasks with softer local fixes and still mark them verified
* multi-evaluator spend does not reliably compound into a stronger strategic choice

This is especially damaging in self-improvement scenarios because the system appears thoughtful but does not force itself to *act* on its deeper insights.

## Suspected Blockers

### 1\. Strategic critique is not being elevated into the authoritative handoff

The evaluator packet can say things like:

* the current approach is `ceiling_approaching`
* the page still feels like a generic template with themed content swapped in
* a different design thesis would raise the ceiling

But the parent mostly acts on `next_tasks.json`, not the broader `critique_packet.md`. If the bigger strategic change is left in `approach_assessment` / `unexplored_approaches` instead of being elevated into the chosen `primary_strategy` and task list, the system reverts to patching.

### 2\. Transformative tasks are too easy to satisfy superficially

In the analyzed run, some injected tasks were phrased ambitiously, but the completion path still allowed softer interpretations (e.g. improving existing AI imagery instead of actually shifting the visual thesis).

We need a stronger notion of: "did the implementation really adopt the requested new approach, or did it perform a milder version and self-certify success?"

### 3\. The evaluator loop may be expensive without yielding a strong synthesized decision

The round evaluator child run was multi-agent and expensive, but ended with no winner. We still effectively progressed from one packet/result path afterward. This suggests we may be paying for multiple evaluators without reliably turning them into a decisive synthesized strategy.

### 4\. The loop is biased toward local optimization

The system currently favors tasks that are:

* concrete
* testable in isolation
* low-risk
* easy to mark verified

That makes sense operationally, but it systematically under-selects the kinds of changes that make outputs feel significantly more professional:

* replacing a section thesis entirely
* deleting/merging weak sections
* changing the visual system rather than polishing individual components
* choosing one stronger organizing concept and rebuilding around it

## Evidence in Current Design / Behavior

Relevant mechanics:

* parent treats `next_tasks.json` as authoritative implementation handoff when present
* `critique_packet.md` is framed as rationale/reference
* injected tasks are mandatory, but only the injected tasks actually drive execution
* `ceiling_status` and `paradigm_shift` are present in evaluator design, but nothing appears to force a strategy shift when the evaluator says the current approach is approaching or hitting a ceiling

## Desired Outcome

When the evaluator identifies a real implementation ceiling or thesis problem, the system should be able to escalate from:

* "fix local defects"

to:

* "adopt a different execution thesis for the next round"

That escalation should happen *inside the authoritative machine-readable handoff*, not just in prose commentary.

## Acceptance Criteria

* If evaluator `approach_assessment.ceiling_status` is `ceiling_approaching` or `ceiling_reached`, the chosen `primary_strategy` in `next_tasks.json` cannot default to a purely incremental patch strategy without explicit justification.
* At least one top-level task must reflect a thesis-level shift when the evaluator identifies a paradigm/ceiling problem.
* Transformative tasks must have verification guidance that distinguishes true thesis adoption from softer local tweaks.
* Managed multi-evaluator runs should not silently degrade into acting on effectively one packet when the evaluator child run has no decisive winner/synthesis.
* We should be able to point to at least one log-analyzed run where evaluator cost was high, but the next round clearly adopted a bolder strategy than “fix a few defects and polish.”

## Potential Fix Directions

* Add stricter policy around `ceiling_status` -> `next_tasks.primary_strategy`
* Require explicit strategy selection when evaluator flags a ceiling / paradigm problem
* Add validation that at least one `transformative` or thesis-level task is present in that case
* Improve verifier semantics for transformative tasks so builders cannot satisfy them with a local minimum
* Revisit how managed round-evaluator refine mode selects and synthesizes the canonical result when child evaluators disagree
* Consider a separate strategy-arbiter stage that chooses one direction from multiple evaluator critiques rather than letting strategic recommendations stay distributed in prose

## Notes

This issue is about the self-improvement loop itself, not just the aesthetics of one Beatles website run. The analyzed run is useful because it exposes the structural failure mode clearly: high evaluator spend, real critique quality, meaningful but still underwhelming improvement, weak strategic escalation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Managed round evaluator over-indexes on incremental fixes despite high spend and strong strategic critique #994

Summary

Observed Example

Why This Matters

Suspected Blockers

1. Strategic critique is not being elevated into the authoritative handoff

2. Transformative tasks are too easy to satisfy superficially

3. The evaluator loop may be expensive without yielding a strong synthesized decision

4. The loop is biased toward local optimization

Evidence in Current Design / Behavior

Desired Outcome

Acceptance Criteria

Potential Fix Directions

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Managed round evaluator over-indexes on incremental fixes despite high spend and strong strategic critique #994

Description

Summary

Observed Example

Why This Matters

Suspected Blockers

1. Strategic critique is not being elevated into the authoritative handoff

2. Transformative tasks are too easy to satisfy superficially

3. The evaluator loop may be expensive without yielding a strong synthesized decision

4. The loop is biased toward local optimization

Evidence in Current Design / Behavior

Desired Outcome

Acceptance Criteria

Potential Fix Directions

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions