feat: single-call Likert scoring for deduplicate_levers by neoneye · Pull Request #373 · PlanExeOrg/PlanExe

neoneye · 2026-03-20T20:41:24Z

Summary

Replaces 18 sequential per-lever LLM calls with a single batch call
Each lever scored on a 5-point Likert relevance scale: how relevant is this lever to this specific plan?
- 2 = highly relevant, 1 = somewhat relevant, 0 = borderline, -1 = low relevance, -2 = irrelevant
Levers scoring >= 1 kept; levers scoring <= 0 removed
Supersedes PR feat: simplify lever classification to primary/secondary/remove #372

Why single-call

The per-lever sequential approach (18 calls) has structural problems:

Position bias: lever 1 classified with no prior context; lever 18 with 17 prior decisions
No global view: model can't compare all levers simultaneously
18x cost: growing context across 18 calls is expensive and slow

Why relevance scoring instead of category classification

Previous iterations (45, 48, 49) shuffled taxonomy labels (keep/absorb/remove → primary/secondary/remove) without significant improvement. The labels mapped 1:1 to each other — just different names for the same categories.

This PR reframes the task entirely: instead of "classify this lever into a bucket", the model answers "how relevant is this lever to this specific plan?" This is a question the model can answer by reading the project context. Duplicates naturally score lower because a redundant lever is less relevant when a better one already exists — no explicit "absorb" concept needed.

The integer score also eliminates template-lock risk (no text definition to copy verbatim).

Changes

deduplicate_levers.py: complete rewrite — BatchDeduplicationResult schema, single llm.as_structured_llm() call, Likert relevance prompt, OPTIMIZE_INSTRUCTIONS with pipeline context
enrich_potential_levers.py: accepts optional classification field (backward-compatible)
runner.py: calls_succeeded=1 for single batch call
B1 fix: user_prompt stores project_context, not serialized levers

Test plan

Run deduplicate_levers step via self-improve runner against snapshot input
Verify all 7 models produce valid scores for all 18 levers
Compare score distributions and kept counts against iter 49 (PR feat: simplify lever classification to primary/secondary/remove #372 baseline)
Check that models produce a spread of scores (not all 1s and 2s)

🤖 Generated with Claude Code

Replace 18 sequential per-lever LLM calls with a single batch call. Each lever is scored on a 5-point Likert scale: 2 = primary (essential strategic decision) 1 = secondary (useful but supporting) 0 = borderline -1 = overlapping (absorbed by another lever) -2 = irrelevant (fully redundant) Levers scoring >= 1 are kept; levers scoring <= 0 are removed. Benefits: - No position bias (model sees all levers simultaneously) - Global consistency (can compare all levers before scoring) - 18x fewer LLM calls (faster, cheaper) - Numeric scores are more granular than categorical labels Also: - B1 fix: user_prompt stores project_context, not levers JSON - enrich_potential_levers accepts optional classification field - runner.py reports calls_succeeded=1 for single batch call - OPTIMIZE_INSTRUCTIONS documents 6 known failure modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match the format from identify_potential_levers.py: goal statement, 6-step pipeline context with "you are here" marker, and structured known-problems section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Score already encodes the classification (2=primary, 1=secondary, <=0=remove). OutputLever.classification is now derived via _score_to_classification() at output time instead of being stored redundantly in LeverDecision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Score descriptions now ask "how relevant is this lever to this specific plan?" instead of mapping to primary/secondary/absorb/remove categories. This shifts the cognitive task from taxonomy classification to relevance assessment — duplicates naturally score lower because redundant levers are less relevant to the plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

No longer needed — score is an integer, not a copyable text label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

neoneye · 2026-03-20T21:48:57Z

Iteration 50 Results

Verdict: REVERT (NO)

What works

Batch architecture proven: 35/35 runs succeeded, all 7 models parse BatchDeduplicationResult in one call
Speed: 1.5–6.3× faster (silo gpt-oss-20b: 196s → 51s, parasomnia haiku: ~120s → 24s)

What's broken

Relevance ≠ deduplication: The prompt asks "how relevant is this lever?" but a lever can be highly relevant AND redundant with another. Capable models (gpt-oss-20b, qwen3, haiku, gemini, 4o-mini) score overlapping levers at 1–2 and keep all 18. gpt-oss-20b silo: 2 removals before → 0 after.
llama3.1 Likert scale inversion: On silo and gta_game, llama3.1 scores 17/18 levers as -2 while justifications say "highly relevant." Only 1 lever survives. Categorical labels can't be inverted; integers can.

Takeaway

The single-call architecture is the real win — keep it. The Likert scoring is the problem — revert to categorical primary/secondary/remove taxonomy (PR #372's schema) inside the batch call.

Full analysis: analysis/50_deduplicate_levers/

Combines the proven batch architecture (1 call instead of 18) with the categorical taxonomy that produced meaningful deduplication in iter 49. Key changes from PR #373 (Likert scoring): - Replace Likert score with categorical primary/secondary/remove - Reframe prompt as deduplication task, not relevance assessment - Fixes iter 49 B2: unambiguous fallback (primary > secondary > remove) - Fixes iter 49 B3: conditional question test for primary classification - Fixes iter 49 S1: calibration as percentage (25-50%) not fixed count - Add duplicate lever_id guard (keeps first, warns on duplicates) - Add minimum lever count warning for degenerate outputs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

neoneye and others added 6 commits March 20, 2026 21:40

refactor: move OPTIMIZE_INSTRUCTIONS to top of file after imports

da81ba2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add pipeline context to OPTIMIZE_INSTRUCTIONS

70698e3

Match the format from identify_potential_levers.py: goal statement, 6-step pipeline context with "you are here" marker, and structured known-problems section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cleanup: remove anti-template-lock instructions

7c39d2d

No longer needed — score is an integer, not a copyable text label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

neoneye mentioned this pull request Mar 20, 2026

feat: batch categorical dedup — single call + primary/secondary/remove #374

Merged

5 tasks

neoneye closed this pull request by merging all changes into main in b9c9577 Mar 21, 2026

neoneye deleted the single-call-dedup-scoring branch March 21, 2026 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: single-call Likert scoring for deduplicate_levers#373

feat: single-call Likert scoring for deduplicate_levers#373
6 commits merged intomainfrom
single-call-dedup-scoring

neoneye commented Mar 20, 2026 •

edited

Loading

Uh oh!

neoneye commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why single-call

Why relevance scoring instead of category classification

Changes

Test plan

Uh oh!

neoneye commented Mar 20, 2026

Iteration 50 Results

What works

What's broken

Takeaway

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neoneye commented Mar 20, 2026 •

edited

Loading