Skip to content

feat: single-call Likert scoring for deduplicate_levers#373

Merged
6 commits merged intomainfrom
single-call-dedup-scoring
Mar 21, 2026
Merged

feat: single-call Likert scoring for deduplicate_levers#373
6 commits merged intomainfrom
single-call-dedup-scoring

Conversation

@neoneye
Copy link
Member

@neoneye neoneye commented Mar 20, 2026

Summary

  • Replaces 18 sequential per-lever LLM calls with a single batch call
  • Each lever scored on a 5-point Likert relevance scale: how relevant is this lever to this specific plan?
    • 2 = highly relevant, 1 = somewhat relevant, 0 = borderline, -1 = low relevance, -2 = irrelevant
  • Levers scoring >= 1 kept; levers scoring <= 0 removed
  • Supersedes PR feat: simplify lever classification to primary/secondary/remove #372

Why single-call

The per-lever sequential approach (18 calls) has structural problems:

  • Position bias: lever 1 classified with no prior context; lever 18 with 17 prior decisions
  • No global view: model can't compare all levers simultaneously
  • 18x cost: growing context across 18 calls is expensive and slow

Why relevance scoring instead of category classification

Previous iterations (45, 48, 49) shuffled taxonomy labels (keep/absorb/remove → primary/secondary/remove) without significant improvement. The labels mapped 1:1 to each other — just different names for the same categories.

This PR reframes the task entirely: instead of "classify this lever into a bucket", the model answers "how relevant is this lever to this specific plan?" This is a question the model can answer by reading the project context. Duplicates naturally score lower because a redundant lever is less relevant when a better one already exists — no explicit "absorb" concept needed.

The integer score also eliminates template-lock risk (no text definition to copy verbatim).

Changes

  • deduplicate_levers.py: complete rewrite — BatchDeduplicationResult schema, single llm.as_structured_llm() call, Likert relevance prompt, OPTIMIZE_INSTRUCTIONS with pipeline context
  • enrich_potential_levers.py: accepts optional classification field (backward-compatible)
  • runner.py: calls_succeeded=1 for single batch call
  • B1 fix: user_prompt stores project_context, not serialized levers

Test plan

🤖 Generated with Claude Code

neoneye and others added 6 commits March 20, 2026 21:40
Replace 18 sequential per-lever LLM calls with a single batch call.
Each lever is scored on a 5-point Likert scale:
  2 = primary (essential strategic decision)
  1 = secondary (useful but supporting)
  0 = borderline
 -1 = overlapping (absorbed by another lever)
 -2 = irrelevant (fully redundant)

Levers scoring >= 1 are kept; levers scoring <= 0 are removed.

Benefits:
- No position bias (model sees all levers simultaneously)
- Global consistency (can compare all levers before scoring)
- 18x fewer LLM calls (faster, cheaper)
- Numeric scores are more granular than categorical labels

Also:
- B1 fix: user_prompt stores project_context, not levers JSON
- enrich_potential_levers accepts optional classification field
- runner.py reports calls_succeeded=1 for single batch call
- OPTIMIZE_INSTRUCTIONS documents 6 known failure modes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match the format from identify_potential_levers.py: goal statement,
6-step pipeline context with "you are here" marker, and structured
known-problems section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Score already encodes the classification (2=primary, 1=secondary, <=0=remove).
OutputLever.classification is now derived via _score_to_classification()
at output time instead of being stored redundantly in LeverDecision.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Score descriptions now ask "how relevant is this lever to this specific
plan?" instead of mapping to primary/secondary/absorb/remove categories.
This shifts the cognitive task from taxonomy classification to relevance
assessment — duplicates naturally score lower because redundant levers
are less relevant to the plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No longer needed — score is an integer, not a copyable text label.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@neoneye
Copy link
Member Author

neoneye commented Mar 20, 2026

Iteration 50 Results

Verdict: REVERT (NO)

What works

  • Batch architecture proven: 35/35 runs succeeded, all 7 models parse BatchDeduplicationResult in one call
  • Speed: 1.5–6.3× faster (silo gpt-oss-20b: 196s → 51s, parasomnia haiku: ~120s → 24s)

What's broken

  • Relevance ≠ deduplication: The prompt asks "how relevant is this lever?" but a lever can be highly relevant AND redundant with another. Capable models (gpt-oss-20b, qwen3, haiku, gemini, 4o-mini) score overlapping levers at 1–2 and keep all 18. gpt-oss-20b silo: 2 removals before → 0 after.
  • llama3.1 Likert scale inversion: On silo and gta_game, llama3.1 scores 17/18 levers as -2 while justifications say "highly relevant." Only 1 lever survives. Categorical labels can't be inverted; integers can.

Takeaway

The single-call architecture is the real win — keep it. The Likert scoring is the problem — revert to categorical primary/secondary/remove taxonomy (PR #372's schema) inside the batch call.

Full analysis: analysis/50_deduplicate_levers/

neoneye added a commit that referenced this pull request Mar 20, 2026
Combines the proven batch architecture (1 call instead of 18) with the
categorical taxonomy that produced meaningful deduplication in iter 49.

Key changes from PR #373 (Likert scoring):
- Replace Likert score with categorical primary/secondary/remove
- Reframe prompt as deduplication task, not relevance assessment
- Fixes iter 49 B2: unambiguous fallback (primary > secondary > remove)
- Fixes iter 49 B3: conditional question test for primary classification
- Fixes iter 49 S1: calibration as percentage (25-50%) not fixed count
- Add duplicate lever_id guard (keeps first, warns on duplicates)
- Add minimum lever count warning for degenerate outputs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@neoneye neoneye closed this pull request by merging all changes into main in b9c9577 Mar 21, 2026
@neoneye neoneye deleted the single-call-dedup-scoring branch March 21, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant