feat: broaden remove to include irrelevant levers, shorten justification#375
Merged
feat: broaden remove to include irrelevant levers, shorten justification#375
Conversation
Replace 18 sequential per-lever LLM calls with a single batch call. Each lever is scored on a 5-point Likert scale: 2 = primary (essential strategic decision) 1 = secondary (useful but supporting) 0 = borderline -1 = overlapping (absorbed by another lever) -2 = irrelevant (fully redundant) Levers scoring >= 1 are kept; levers scoring <= 0 are removed. Benefits: - No position bias (model sees all levers simultaneously) - Global consistency (can compare all levers before scoring) - 18x fewer LLM calls (faster, cheaper) - Numeric scores are more granular than categorical labels Also: - B1 fix: user_prompt stores project_context, not levers JSON - enrich_potential_levers accepts optional classification field - runner.py reports calls_succeeded=1 for single batch call - OPTIMIZE_INSTRUCTIONS documents 6 known failure modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match the format from identify_potential_levers.py: goal statement, 6-step pipeline context with "you are here" marker, and structured known-problems section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Score already encodes the classification (2=primary, 1=secondary, <=0=remove). OutputLever.classification is now derived via _score_to_classification() at output time instead of being stored redundantly in LeverDecision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Score descriptions now ask "how relevant is this lever to this specific plan?" instead of mapping to primary/secondary/absorb/remove categories. This shifts the cognitive task from taxonomy classification to relevance assessment — duplicates naturally score lower because redundant levers are less relevant to the plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No longer needed — score is an integer, not a copyable text label. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combines the proven batch architecture (1 call instead of 18) with the categorical taxonomy that produced meaningful deduplication in iter 49. Key changes from PR #373 (Likert scoring): - Replace Likert score with categorical primary/secondary/remove - Reframe prompt as deduplication task, not relevance assessment - Fixes iter 49 B2: unambiguous fallback (primary > secondary > remove) - Fixes iter 49 B3: conditional question test for primary classification - Fixes iter 49 S1: calibration as percentage (25-50%) not fixed count - Add duplicate lever_id guard (keeps first, warns on duplicates) - Add minimum lever count warning for degenerate outputs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation - Remove classification now covers: redundant, overlapping, OR irrelevant to the plan (upstream may generate levers that don't apply) - Justification target shortened from ~40-80 words to ~40 words Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… general one When two levers overlap or one is a subset, keep whichever better captures the strategic decision rather than always preferring the more general one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Levers the LLM failed to classify should not be elevated to primary. Secondary keeps them safe from removal without inflating the primary count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The step now does more than deduplication — it also removes irrelevant levers and classifies survivors as primary/secondary. "Triaging" better describes the full scope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Less output per lever should help llama3.1 complete within timeout for all 18 levers in a single batch call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
Iteration 52 ResultsVerdict: YES (KEEP) Main win: llama3.1 timeouts eliminated
Performance
Dedup quality
Documentation debt
Full analysis: analysis/52_deduplicate_levers/ |
Member
Author
Comparison: Iter 52 (PR #375) vs Iter 48 (main baseline)Full comparison across all 70 output files and 14 events files. Speed: 3.0x overall
Lever retention: +12%
New signal: primary/secondary triage
Architecture
Full data: comparison_iter48_vs_iter52.md |
This was referenced Mar 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Incremental refinements on PR #374 (batch categorical dedup):
removenow covers redundant, overlapping, subset, OR irrelevantBatchDeduplicationResult→DeduplicationResultSupersedes PR #374.
Test plan
deduplicate_leversstep via self-improve runner against snapshot input🤖 Generated with Claude Code