feat: batch categorical dedup — single call + primary/secondary/remove by neoneye · Pull Request #374 · PlanExeOrg/PlanExe

neoneye · 2026-03-20T21:53:14Z

Summary

Combines the best of PR #372 and PR #373:

Single batch call from PR feat: single-call Likert scoring for deduplicate_levers #373 (18x fewer LLM calls, 1.5-6x faster)
Categorical primary/secondary/remove from PR feat: simplify lever classification to primary/secondary/remove #372 (proven deduplication)
Prompt fixes from iter 49 synthesis (B2, B3, S1)
Robustness guards from iter 50 assessment (duplicate lever_id, minimum count)

Supersedes PR #372 and PR #373.

What failed in PR #373 (Likert scoring)

Relevance ≠ deduplication: models scored overlapping levers as relevant and kept them all
llama3.1 inverted the integer scale catastrophically (1/18 levers survived)
Categorical labels can't be inverted; integers can

What this PR does differently

Asks "Is this lever redundant?" (deduplication) not "Is this lever relevant?" (scoring)
Uses categorical labels (primary/secondary/remove) that can't be polarity-inverted
Keeps the single batch call architecture that was proven in PR feat: single-call Likert scoring for deduplicate_levers #373 (35/35 success)

Prompt fixes incorporated

B2: Unambiguous fallback — "prefer primary over secondary; prefer secondary over remove"
B3: Conditional question test for primary — "If this lever were handled badly, would the project fail?" (no copyable definition text)
S1: Calibration as percentage — "Expect to remove 25-50%" (not anchored to a fixed lever count)

Robustness guards

Duplicate lever_id: keeps first entry, logs warning (was silently overwriting)
Minimum lever count: warns if fewer than max(3, input/4) levers survive

Test plan

Run deduplicate_levers step via self-improve runner against snapshot input
Verify all 7 models produce valid classifications for all 18 levers
Compare removal rates against iter 49 (PR feat: simplify lever classification to primary/secondary/remove #372) — expect similar or better
Verify no scale-inversion failures (categorical labels prevent this by construction)
Check batch call speed improvement is preserved from PR feat: single-call Likert scoring for deduplicate_levers #373

🤖 Generated with Claude Code

Replace 18 sequential per-lever LLM calls with a single batch call. Each lever is scored on a 5-point Likert scale: 2 = primary (essential strategic decision) 1 = secondary (useful but supporting) 0 = borderline -1 = overlapping (absorbed by another lever) -2 = irrelevant (fully redundant) Levers scoring >= 1 are kept; levers scoring <= 0 are removed. Benefits: - No position bias (model sees all levers simultaneously) - Global consistency (can compare all levers before scoring) - 18x fewer LLM calls (faster, cheaper) - Numeric scores are more granular than categorical labels Also: - B1 fix: user_prompt stores project_context, not levers JSON - enrich_potential_levers accepts optional classification field - runner.py reports calls_succeeded=1 for single batch call - OPTIMIZE_INSTRUCTIONS documents 6 known failure modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match the format from identify_potential_levers.py: goal statement, 6-step pipeline context with "you are here" marker, and structured known-problems section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Score already encodes the classification (2=primary, 1=secondary, <=0=remove). OutputLever.classification is now derived via _score_to_classification() at output time instead of being stored redundantly in LeverDecision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Score descriptions now ask "how relevant is this lever to this specific plan?" instead of mapping to primary/secondary/absorb/remove categories. This shifts the cognitive task from taxonomy classification to relevance assessment — duplicates naturally score lower because redundant levers are less relevant to the plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

No longer needed — score is an integer, not a copyable text label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Combines the proven batch architecture (1 call instead of 18) with the categorical taxonomy that produced meaningful deduplication in iter 49. Key changes from PR #373 (Likert scoring): - Replace Likert score with categorical primary/secondary/remove - Reframe prompt as deduplication task, not relevance assessment - Fixes iter 49 B2: unambiguous fallback (primary > secondary > remove) - Fixes iter 49 B3: conditional question test for primary classification - Fixes iter 49 S1: calibration as percentage (25-50%) not fixed count - Add duplicate lever_id guard (keeps first, warns on duplicates) - Add minimum lever count warning for degenerate outputs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

neoneye · 2026-03-20T22:33:20Z

Iteration 51 Results

Verdict: YES (KEEP)

Deduplication restored

Capable models remove 6-33% of levers (was 0% under PR feat: single-call Likert scoring for deduplicate_levers #373 Likert scoring)
All sampled remove justifications cite the overlapping lever ID
gpt-oss-20b silo: 5 removals (was 0 under Likert); haiku hong_kong_game: 4-6 removals

llama3.1 scale inversion eliminated

Categorical labels can't be polarity-inverted (was 1/18 kept under Likert)
llama3.1 now times out on 2/5 plans (silo, parasomnia) → safe fallback (18/18 primary)
On 3/5 plans, llama3.1 produces valid output (e.g., gta_game: 17/18 kept, 1 proper remove)

Speed preserved

Single batch call architecture maintained (1 call, not 18)
Most models comparable or faster than PR feat: single-call Likert scoring for deduplicate_levers #373

Remaining issues (not blockers)

llama3.1 120s timeout too short for 2/5 plans (config fix, not code)
Silent failure masking: timeout records status=ok (observability improvement for follow-up)
Removal rates below 25-50% calibration target for most models (prompt tuning opportunity)

Full analysis: analysis/51_deduplicate_levers/

neoneye and others added 7 commits March 20, 2026 21:40

refactor: move OPTIMIZE_INSTRUCTIONS to top of file after imports

da81ba2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add pipeline context to OPTIMIZE_INSTRUCTIONS

70698e3

Match the format from identify_potential_levers.py: goal statement, 6-step pipeline context with "you are here" marker, and structured known-problems section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cleanup: remove anti-template-lock instructions

7c39d2d

No longer needed — score is an integer, not a copyable text label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

neoneye mentioned this pull request Mar 20, 2026

feat: broaden remove to include irrelevant levers, shorten justification #375

Merged

4 tasks

neoneye closed this pull request by merging all changes into main in b9c9577 Mar 21, 2026

neoneye deleted the batch-categorical-dedup branch March 21, 2026 02:06

neoneye mentioned this pull request Mar 21, 2026

config: increase llama3.1 request_timeout from 120s to 180s #376

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: batch categorical dedup — single call + primary/secondary/remove#374

feat: batch categorical dedup — single call + primary/secondary/remove#374
7 commits merged intomainfrom
batch-categorical-dedup

neoneye commented Mar 20, 2026

Uh oh!

neoneye commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented Mar 20, 2026

Summary

What failed in PR #373 (Likert scoring)

What this PR does differently

Prompt fixes incorporated

Robustness guards

Test plan

Uh oh!

neoneye commented Mar 20, 2026

Iteration 51 Results

Deduplication restored

llama3.1 scale inversion eliminated

Speed preserved

Remaining issues (not blockers)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant