feat: broaden remove to include irrelevant levers, shorten justification by neoneye · Pull Request #375 · PlanExeOrg/PlanExe

neoneye · 2026-03-20T22:42:11Z

Summary

Incremental refinements on PR #374 (batch categorical dedup):

Remove now covers irrelevant levers: Upstream may generate levers that don't apply to the plan. remove now covers redundant, overlapping, subset, OR irrelevant
Better overlap handling: "keep the one that better captures the strategic decision" instead of always preferring the more general one
Shorter justifications: ~20-30 words (down from ~40-80). Less output = llama3.1 more likely to finish within timeout
Fallback to secondary: Unclassified levers default to secondary, not primary — kept safe without inflating primary count
Triaging framing: System prompt says "triaging" instead of "deduplicating" — better describes the full scope (classify + remove irrelevant + deduplicate)
Renamed: BatchDeduplicationResult → DeduplicationResult

Supersedes PR #374.

Test plan

Run deduplicate_levers step via self-improve runner against snapshot input
Compare removal rates against iter 51 (PR feat: batch categorical dedup — single call + primary/secondary/remove #374) — expect similar or slightly higher with irrelevant lever removal
Check if llama3.1 completes more plans within timeout (shorter justifications)
Verify justifications are shorter on average

🤖 Generated with Claude Code

Replace 18 sequential per-lever LLM calls with a single batch call. Each lever is scored on a 5-point Likert scale: 2 = primary (essential strategic decision) 1 = secondary (useful but supporting) 0 = borderline -1 = overlapping (absorbed by another lever) -2 = irrelevant (fully redundant) Levers scoring >= 1 are kept; levers scoring <= 0 are removed. Benefits: - No position bias (model sees all levers simultaneously) - Global consistency (can compare all levers before scoring) - 18x fewer LLM calls (faster, cheaper) - Numeric scores are more granular than categorical labels Also: - B1 fix: user_prompt stores project_context, not levers JSON - enrich_potential_levers accepts optional classification field - runner.py reports calls_succeeded=1 for single batch call - OPTIMIZE_INSTRUCTIONS documents 6 known failure modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match the format from identify_potential_levers.py: goal statement, 6-step pipeline context with "you are here" marker, and structured known-problems section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Score already encodes the classification (2=primary, 1=secondary, <=0=remove). OutputLever.classification is now derived via _score_to_classification() at output time instead of being stored redundantly in LeverDecision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Score descriptions now ask "how relevant is this lever to this specific plan?" instead of mapping to primary/secondary/absorb/remove categories. This shifts the cognitive task from taxonomy classification to relevance assessment — duplicates naturally score lower because redundant levers are less relevant to the plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

No longer needed — score is an integer, not a copyable text label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Combines the proven batch architecture (1 call instead of 18) with the categorical taxonomy that produced meaningful deduplication in iter 49. Key changes from PR #373 (Likert scoring): - Replace Likert score with categorical primary/secondary/remove - Reframe prompt as deduplication task, not relevance assessment - Fixes iter 49 B2: unambiguous fallback (primary > secondary > remove) - Fixes iter 49 B3: conditional question test for primary classification - Fixes iter 49 S1: calibration as percentage (25-50%) not fixed count - Add duplicate lever_id guard (keeps first, warns on duplicates) - Add minimum lever count warning for degenerate outputs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ation - Remove classification now covers: redundant, overlapping, OR irrelevant to the plan (upstream may generate levers that don't apply) - Justification target shortened from ~40-80 words to ~40 words Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… general one When two levers overlap or one is a subset, keep whichever better captures the strategic decision rather than always preferring the more general one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Levers the LLM failed to classify should not be elevated to primary. Secondary keeps them safe from removal without inflating the primary count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The step now does more than deduplication — it also removes irrelevant levers and classifies survivors as primary/secondary. "Triaging" better describes the full scope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Less output per lever should help llama3.1 complete within timeout for all 18 levers in a single batch call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

neoneye · 2026-03-20T23:37:27Z

Iteration 52 Results

Verdict: YES (KEEP)

Main win: llama3.1 timeouts eliminated

Before (PR feat: batch categorical dedup — single call + primary/secondary/remove #374): 2/5 plans timed out → all-primary fallback
After (PR feat: broaden remove to include irrelevant levers, shorten justification #375): 0/5 timeouts, all 35 runs produce real LLM output
Silo: 120.16s → 114.75s (~5s margin), parasomnia: 120.12s → 71.88s

Performance

API model justifications ~55% shorter (haiku: 43 → 18 words)
Avg silo duration ~25% faster across API models
qwen3 silo: 69.4s → 33.8s (−51%)

Dedup quality

Haiku silo removes improved (1 → 2)
"Keep strategically better" overlap preference evidenced (haiku hong_kong_game)
llama3.1 silo still produces 0 removes despite completing — model limitation, not code bug

Documentation debt

OPTIMIZE_INSTRUCTIONS still says "keep the more general lever" but prompt now says "keep the strategically better one" — needs updating

Full analysis: analysis/52_deduplicate_levers/

neoneye · 2026-03-20T23:54:24Z

Comparison: Iter 52 (PR #375) vs Iter 48 (main baseline)

Full comparison across all 70 output files and 14 events files.

Speed: 3.0x overall

Model	Iter 48	Iter 52	Speedup
qwen3-30b-a3b	256.8s	41.3s	6.2x
gpt-oss-20b	131.5s	39.6s	3.3x
gemini-2.0-flash	72.2s	25.5s	2.8x
haiku-4.5	86.3s	31.6s	2.7x
gpt-5-nano	69.3s	27.3s	2.5x
gpt-4o-mini	65.2s	28.6s	2.3x
llama3.1 (local)	162.5s	88.5s	1.8x

Lever retention: +12%

Iter 48 avg: 13.9/18 kept → Iter 52 avg: 15.6/18 kept
gpt-5-nano biggest gain: 9.2 → 14.8 (+5.6)
gpt-oss-20b slight decrease: 13.4 → 13.0 (most aggressive remover)

New signal: primary/secondary triage

Iter 48: only keep (no prioritization)
Iter 52: 54% primary, 31% secondary, 15% remove — downstream steps get prioritization signal

Architecture

18 sequential LLM calls → 1 batch call
No position bias, global consistency, cheaper

Full data: comparison_iter48_vs_iter52.md

neoneye and others added 14 commits March 20, 2026 21:40

refactor: move OPTIMIZE_INSTRUCTIONS to top of file after imports

da81ba2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add pipeline context to OPTIMIZE_INSTRUCTIONS

70698e3

Match the format from identify_potential_levers.py: goal statement, 6-step pipeline context with "you are here" marker, and structured known-problems section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cleanup: remove anti-template-lock instructions

7c39d2d

No longer needed — score is an integer, not a copyable text label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: default unclassified levers to secondary, not primary

e6e1006

Levers the LLM failed to classify should not be elevated to primary. Secondary keeps them safe from removal without inflating the primary count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cleanup: simplify docstring

0964f3d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rename: BatchDeduplicationResult -> DeduplicationResult

bb71aa0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tweak: shorten justification target to 20-30 words

4d07c3b

Less output per lever should help llama3.1 complete within timeout for all 18 levers in a single batch call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

neoneye merged commit b9c9577 into main Mar 21, 2026
3 checks passed

neoneye deleted the batch-categorical-dedup-v2 branch March 21, 2026 02:03

This was referenced Mar 21, 2026

feat: simplify lever classification to primary/secondary/remove #372

Closed

feat: consolidate deduplicate_levers — classification, safety valve, B3 fix #365

Closed

config: increase llama3.1 request_timeout from 120s to 180s #376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: broaden remove to include irrelevant levers, shorten justification#375

feat: broaden remove to include irrelevant levers, shorten justification#375
neoneye merged 14 commits intomainfrom
batch-categorical-dedup-v2

neoneye commented Mar 20, 2026 •

edited

Loading

Uh oh!

neoneye commented Mar 20, 2026

Uh oh!

neoneye commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

neoneye commented Mar 20, 2026

Iteration 52 Results

Main win: llama3.1 timeouts eliminated

Performance

Dedup quality

Documentation debt

Uh oh!

neoneye commented Mar 20, 2026

Comparison: Iter 52 (PR #375) vs Iter 48 (main baseline)

Speed: 3.0x overall

Lever retention: +12%

New signal: primary/secondary triage

Architecture

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neoneye commented Mar 20, 2026 •

edited

Loading