Skip to content

[BUG] Managed round evaluator over-indexes on incremental fixes despite high spend and strong strategic critiqueย #994

@ncrispino

Description

@ncrispino

Summary

In managed round_evaluator runs, we can spend more on evaluator/criteria subagents than on the parent builder while still converging on mostly incremental changes. The evaluator often recognizes higher-order problems, but the system does not reliably turn that diagnosis into a committed new implementation thesis.

This weakens MassGen's self-improvement loop: we pay for multi-agent critique, but the resulting next_tasks.json often collapses back to safe local optimization.

Observed Example

Run analyzed:
/Users/ncrispin/GitHubProjects/MassGenMore/.massgen/massgen_logs/log_20260311_135039_180110

Known spend from this run:

  • Main parent run: $9.568984
  • Criteria-generation subagent: $0.808529
  • Round evaluator sub-run: $10.914117
  • Known total: $21.29163
  • Subagents share of known total: 55.1%
  • Round evaluator share of known total: 51.3%

The second answer did improve materially, but not enough to justify the evaluator spend. The main pattern was:

  • evaluator packet identified deeper issues around ceiling / generic-template feel / lack of stronger thesis
  • injected task plan still mostly resolved to corrective tasks and locally-verifiable upgrades
  • builder completed those tasks in a relatively conservative way
  • run did not cleanly converge; parent submitted answer 2, then started another evaluator stage and stalled

Why This Matters

The current behavior encourages a bad equilibrium:

  • critique is sharper than execution
  • strategic diagnosis is more ambitious than the machine-readable handoff
  • builders can satisfy transformative tasks with softer local fixes and still mark them verified
  • multi-evaluator spend does not reliably compound into a stronger strategic choice

This is especially damaging in self-improvement scenarios because the system appears thoughtful but does not force itself to act on its deeper insights.

Suspected Blockers

1. Strategic critique is not being elevated into the authoritative handoff

The evaluator packet can say things like:

  • the current approach is ceiling_approaching
  • the page still feels like a generic template with themed content swapped in
  • a different design thesis would raise the ceiling

But the parent mostly acts on next_tasks.json, not the broader critique_packet.md. If the bigger strategic change is left in approach_assessment / unexplored_approaches instead of being elevated into the chosen primary_strategy and task list, the system reverts to patching.

2. Transformative tasks are too easy to satisfy superficially

In the analyzed run, some injected tasks were phrased ambitiously, but the completion path still allowed softer interpretations (e.g. improving existing AI imagery instead of actually shifting the visual thesis).

We need a stronger notion of: "did the implementation really adopt the requested new approach, or did it perform a milder version and self-certify success?"

3. The evaluator loop may be expensive without yielding a strong synthesized decision

The round evaluator child run was multi-agent and expensive, but ended with no winner. We still effectively progressed from one packet/result path afterward. This suggests we may be paying for multiple evaluators without reliably turning them into a decisive synthesized strategy.

4. The loop is biased toward local optimization

The system currently favors tasks that are:

  • concrete
  • testable in isolation
  • low-risk
  • easy to mark verified

That makes sense operationally, but it systematically under-selects the kinds of changes that make outputs feel significantly more professional:

  • replacing a section thesis entirely
  • deleting/merging weak sections
  • changing the visual system rather than polishing individual components
  • choosing one stronger organizing concept and rebuilding around it

Evidence in Current Design / Behavior

Relevant mechanics:

  • parent treats next_tasks.json as authoritative implementation handoff when present
  • critique_packet.md is framed as rationale/reference
  • injected tasks are mandatory, but only the injected tasks actually drive execution
  • ceiling_status and paradigm_shift are present in evaluator design, but nothing appears to force a strategy shift when the evaluator says the current approach is approaching or hitting a ceiling

Desired Outcome

When the evaluator identifies a real implementation ceiling or thesis problem, the system should be able to escalate from:

  • "fix local defects"

to:

  • "adopt a different execution thesis for the next round"

That escalation should happen inside the authoritative machine-readable handoff, not just in prose commentary.

Acceptance Criteria

  • If evaluator approach_assessment.ceiling_status is ceiling_approaching or ceiling_reached, the chosen primary_strategy in next_tasks.json cannot default to a purely incremental patch strategy without explicit justification.
  • At least one top-level task must reflect a thesis-level shift when the evaluator identifies a paradigm/ceiling problem.
  • Transformative tasks must have verification guidance that distinguishes true thesis adoption from softer local tweaks.
  • Managed multi-evaluator runs should not silently degrade into acting on effectively one packet when the evaluator child run has no decisive winner/synthesis.
  • We should be able to point to at least one log-analyzed run where evaluator cost was high, but the next round clearly adopted a bolder strategy than โ€œfix a few defects and polish.โ€

Potential Fix Directions

  • Add stricter policy around ceiling_status -> next_tasks.primary_strategy
  • Require explicit strategy selection when evaluator flags a ceiling / paradigm problem
  • Add validation that at least one transformative or thesis-level task is present in that case
  • Improve verifier semantics for transformative tasks so builders cannot satisfy them with a local minimum
  • Revisit how managed round-evaluator refine mode selects and synthesizes the canonical result when child evaluators disagree
  • Consider a separate strategy-arbiter stage that chooses one direction from multiple evaluator critiques rather than letting strategic recommendations stay distributed in prose

Notes

This issue is about the self-improvement loop itself, not just the aesthetics of one Beatles website run. The analyzed run is useful because it exposes the structural failure mode clearly: high evaluator spend, real critique quality, meaningful but still underwhelming improvement, weak strategic escalation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions