PlanExeOrg · neoneye · Mar 21, 2026 · Mar 21, 2026
diff --git a/docs/proposals/122-deduplicate-levers.md b/docs/proposals/122-deduplicate-levers.md
@@ -0,0 +1,358 @@
+# Proposal 122: Deduplicate Levers — Architecture and Improvements
+
+## Status
+
+**Current state**: PR #375 merged (2026-03-21). Single batch call with
+primary/secondary/remove taxonomy. 3x faster than baseline, all models
+complete successfully. Best iteration: 52.
+
+**This proposal**: documents the iteration journey, known issues, and
+future improvement directions for the deduplicate_levers step.
+
+---
+
+## Pipeline Context
+
+DeduplicateLevers is step 2 in a 6-step solution-space exploration pipeline:
+
+1. **IdentifyPotentialLevers** — brainstorms 15-20 raw levers
+2. **DeduplicateLevers** ← this step
+3. **EnrichLevers** — adds description, synergy, and conflict text
+4. **FocusOnVitalFewLevers** — filters down to 4-6 high-impact levers
+5. **ScenarioGeneration** — builds 3 scenarios (aggressive, medium, safe)
+6. **ScenarioSelection** — picks the best-fitting scenario
+
+Step 1 intentionally over-generates. This step removes near-duplicates,
+filters irrelevant levers, and tags survivors as primary (strategic) or
+secondary (operational). Step 4 handles further filtering.
+
+---
+
+## Iteration History (iter 44-52)
+
+Nine iterations across five PRs to reach the current state.
+
+| Iter | PR | Architecture | Taxonomy | Verdict | Key insight |
+|------|-----|-------------|----------|---------|-------------|
+| 48 | — | 18 sequential calls | keep/absorb/remove | BASELINE | llama3.1 collapsed 7 levers into "Risk Framing" |
+| 45 | #365 | 18 sequential calls | primary/secondary/absorb/remove (4-way) | YES (5/7) | Primary/secondary triage is the real quality gain. `remove` dead when `absorb` exists |
+| 49 | #372 | 18 sequential calls | primary/secondary/remove (3-way) | YES | All 3 categories exercised. Template-lock identified |
+| 50 | #373 | 1 batch call | Likert scoring (-2 to +2) | REVERT | Relevance != deduplication. llama3.1 inverted the scale |
+| 51 | #374 | 1 batch call | primary/secondary/remove | YES | Batch + categorical works. llama3.1 timed out 2/5 plans |
+| 52 | #375 | 1 batch call | primary/secondary/remove | YES (merged) | Shorter justifications fixed llama3.1 timeout |
+
+### What moved the needle
+
+1. **Single batch call** (iter 50-52): 18 calls → 1. 3x faster, no
+   position bias, global consistency, simpler code (190 vs 330 lines).
+
+2. **Primary/secondary triage** (iter 45+): New downstream signal. Main
+   branch only had `keep` — no prioritization information.
+
+3. **Shorter justifications** (iter 52): ~20-30 words instead of ~40-80.
+   Fixed llama3.1 timeout, API models 55% shorter and 25% faster.
+
+### What didn't move the needle
+
+- **Taxonomy label changes**: Renaming keep→primary, absorb→remove produced
+  nearly identical results. Labels are interchangeable.
+- **Anti-template-lock instructions**: Not needed with short categorical labels.
+- **Calibration hints**: Models that remove aggressively do so regardless.
+  Conservative models ignore calibration guidance.
+
+### Current metrics (iter 52 vs baseline)
+
+| Metric | Baseline (iter 48) | Current (iter 52) |
+|--------|-------------------|-------------------|
+| Architecture | 18 sequential calls | 1 batch call |
+| Taxonomy | keep/absorb/remove | primary/secondary/remove |
+| Triage signal | None | primary 54% / secondary 31% |
+| Avg kept | 13.9 / 18 | 15.6 / 18 |
+| Avg removed | 4.1 / 18 (23%) | 2.4 / 18 (15%) |
+| Avg duration | 120.5s | 40.3s |
+| llama3.1 failures | Collapse into "Risk Framing" | None |
+
+---
+
+## Known Issues
+
+### Structural issues
+
+#### 1. The step conflates deduplication with prioritization
+
+The current schema asks the LLM to make two decisions at once:
+- Whether a lever survives deduplication (keep vs remove)
+- Whether a surviving lever is strategically important (primary vs secondary)
+
+A lever can be clearly distinct but low priority, or highly important but
+partly redundant with another broader lever. By fusing these decisions,
+the step creates a bias toward keeping anything that seems important, even
+if it overlaps heavily with another lever.
+
+The step drifts toward "strategic triage" rather than real overlap reduction.
+
+#### 2. No explicit absorption structure
+
+When a lever is removed, only a freeform justification is stored. There is
+no structured `absorbed_into` field. The pipeline loses information about
+which surviving lever subsumed the removed one.
+
+Without absorption links:
+- You cannot audit whether removal was correct
+- You cannot detect hierarchy reversals (narrow lever kept, general removed)
+- You cannot detect chain absorptions (A→B→C where B is also removed)
+- Later stages cannot recover wording or evidence from removed items
+
+#### 3. The retention bias is strong
+
+The uncertainty rules produce predictable skew:
+- Uncertain between primary and secondary → primary (promotes)
+- Uncertain between keep and remove → secondary (keeps)
+- Missing decisions → secondary (keeps)
+
+The model is rewarded for being vague. The step becomes a low-risk
+classifier instead of a real deduplicator.
+
+#### 4. No survivor-overlap validation
+
+No post-check asks whether surviving items still overlap heavily. Two
+survivors might be different phrasings of the same lever, or policy,
+procurement, and standards variants of one underlying mechanism.
+
+#### 5. No category balance awareness
+
+The flat comparison setup encourages the model to favor narratively vivid
+levers (creative, character, thematic) over less glamorous but equally
+critical ones (financing, legal, operations, distribution).
+
+#### 6. Single-batch reasoning is useful but brittle
+
+One batch call gives global context, but one bad completion corrupts the
+whole output. There is no repair pass for missing IDs, suspicious
+distributions, or ambiguous overlap clusters.
+
+### Implementation issues
+
+#### 7. Silent failure masking
+
+When the LLM call times out, `batch_result` stays `None`, all levers
+default to secondary, and `outputs.jsonl` records `status=ok` with
+`calls_succeeded=1`. Monitoring pipelines cannot detect these failures.
+
+#### 8. `user_prompt` field stores wrong value
+
+`user_prompt=project_context` at line 272 stores the plan description, not
+the full assembled prompt including the levers JSON. The saved artifact
+cannot reconstruct the exact LLM input.
+
+#### 9. `calls_succeeded` hardcoded
+
+`runner.py` returns `calls_succeeded=1` regardless of whether the LLM call
+succeeded or the fallback fired.
+
+#### 10. Minimum count threshold is too low
+
+`max(3, len(input_levers) // 4)` = 4 for 18 levers. A model removing 14/18
+still clears the warning. Consider `max(5, len(input_levers) // 3)`.
+
+---
+
+## Improvement Proposals
+
+### Option A: Incremental improvements (low risk)
+
+Keep the current single-call architecture and taxonomy. Fix implementation
+issues and add validation.
+
+**Changes:**
+1. Add `absorbed_into: str | None` field to `LeverClassificationDecision`.
+   Required when classification is `remove` and the lever overlaps another.
+   Enables merge-graph validation.
+
+2. Add `remove_reason: Literal["duplicate", "subset", "irrelevant", "too_narrow"]`
+   to make removals auditable and categorized.
+
+3. Add a survivor-overlap validation pass. After the main classification,
+   compute similarity between surviving levers (by name/consequences overlap).
+   If two survivors are suspiciously similar, log a warning or trigger a
+   focused comparison.
+
+4. Replace `missing → secondary` with `missing → unresolved`. Send unresolved
+   items through a focused repair call rather than silently keeping them.
+
+5. Fix observability: expose `llm_call_succeeded` from `DeduplicateLevers`,
+   emit `classification_fallback` events in `events.jsonl`, fix `user_prompt`
+   field, make `calls_succeeded` reflect reality.
+
+6. Add calibration checks: warn if >70% survive (likely under-dedup) or
+   <35% survive (likely over-removal). Optionally trigger a repair pass.
+
+**Effort**: Medium. Each change is independent and can be shipped separately.
+
+### Option B: Two-pass architecture (medium risk)
+
+Separate deduplication from prioritization into distinct passes.
+
+**Pass 1 — Overlap resolution:**
+- Decisions: `keep` / `absorb` / `remove`
+- Only answers: does this lever survive as a distinct concept?
+- Requires `absorbed_into` for absorb decisions
+
+**Pass 2 — Prioritization of survivors:**
+- Decisions: `primary` / `secondary`
+- Only answers: among surviving levers, which are top-level strategic?
+
+**Benefits:**
+- Cleaner separation of concerns
+- Deduplication quality not contaminated by importance assessment
+- Each pass is simpler for the LLM
+
+**Risks:**
+- Two LLM calls instead of one (cost, latency)
+- More complex orchestration
+- Pass 2 may disagree with pass 1's keep decisions
+
+**Effort**: High. New schemas, two-call orchestration, compatibility with
+existing runner and analysis pipeline.
+
+### Option C: Cluster-based deduplication (higher risk)
+
+Instead of item-level labels, first cluster semantically similar levers,
+then pick representatives within each cluster.
+
+**Step 1**: Group all levers into semantic clusters in one call.
+**Step 2**: For each cluster with >1 lever, pick the canonical representative.
+Tag as primary/secondary. Mark others as absorbed.
+
+**Output shape:**
+```json
+{
+  "cluster_id": "procurement-conditions",
+  "canonical": "Procurement Conditionality",
+  "absorbed": [
+    {
+      "lever_id": "...",
+      "reason": "near_duplicate",
+      "absorbed_into": "Procurement Conditionality"
+    }
+  ]
+}
+```
+
+**Benefits:**
+- Most interpretable output
+- Natural absorption structure
+- Explicit nearest-neighbor reasoning within clusters
+
+**Risks:**
+- Most implementation work
+- Clustering quality depends on model capability
+- Output schema change affects downstream consumers
+
+**Effort**: High. New clustering schema, new orchestration, downstream
+consumer updates.
+
+### Option D: Mechanism-based deduplication (research)
+
+Force the model to decompose each lever into structured fields before
+deduplicating:
+
+- Target actor
+- Intervention mechanism
+- Expected effect
+- Time horizon
+- Implementation domain
+
+Then deduplicate primarily on mechanism + actor + effect, not just
+semantic similarity. This would reduce both false merges (same topic,
+different mechanism) and false splits (different wording, same mechanism).
+
+**Benefits:**
+- Most principled approach to deduplication
+- Catches "sounds similar but different mechanism" false merges
+
+**Risks:**
+- Significantly more output per lever (structured decomposition)
+- May exceed token budgets for weak models
+- Research-grade — untested in this pipeline
+
+**Effort**: Very high. New decomposition schema, new comparison logic,
+likely two calls minimum.
+
+---
+
+## Recommendation
+
+### Near-term (next 1-2 iterations)
+
+Implement Option A items 1-2 and 5:
+- Add `absorbed_into` field for auditable removal
+- Add `remove_reason` categorization
+- Fix observability (silent failure masking, `user_prompt`, `calls_succeeded`)
+
+These are low-risk, independent changes that make the step more inspectable
+without changing its behavior.
+
+### Medium-term (next 3-5 iterations)
+
+Implement Option A items 3-4 and 6:
+- Survivor-overlap validation pass
+- Replace `missing → secondary` with `missing → unresolved` + repair
+- Calibration checks with optional second pass
+
+### Long-term (future consideration)
+
+Evaluate Option B (two-pass) or Option C (cluster-based) based on whether
+the incremental improvements are sufficient. The current single-call
+architecture may hit a quality ceiling where one LLM call cannot reliably
+do both overlap detection and prioritization well. If that happens, Option B
+is the natural next step.
+
+Option D (mechanism-based) is interesting but too expensive for the current
+model roster. Worth revisiting when cheaper structured-output models become
+available.
+
+---
+
+## Lessons Learned
+
+### From 9 iterations of optimization
+
+1. **Architecture matters more than taxonomy.** Changing labels (keep vs
+   primary, absorb vs remove) across 5 iterations produced nearly identical
+   results. Changing from 18 sequential calls to 1 batch call produced a
+   3x speedup and eliminated position bias.
+
+2. **Relevance and deduplication are different questions.** A lever can be
+   highly relevant to the plan AND fully redundant with another lever.
+   Asking "how relevant?" (iter 50, Likert scoring) produced 0% removal
+   for capable models. Asking "is this redundant?" (iter 51-52, categorical)
+   restored deduplication.
+
+3. **Integer scales can be inverted; categorical labels cannot.** llama3.1
+   scored 17/18 levers as -2 while writing "highly relevant" in
+   justifications. This failure mode is structurally impossible with
+   categorical labels.
+
+4. **Output length directly affects model completion.** Shortening
+   justifications from ~40-80 words to ~20-30 words let llama3.1 finish
+   within timeout on all plans. Advisory length constraints work for API
+   models but are fragile for local models.
+
+5. **The step currently mixes two goals.** Deduplication (overlap reduction)
+   and prioritization (primary vs secondary) are fused into one decision.
+   This creates a retention bias — anything that seems important survives,
+   even if it overlaps. Separating these concerns is the most promising
+   architectural improvement.
+
+6. **Conservative retention is a valid design choice** for an intermediate
+   pipeline stage, but it has consequences: the output is noisier, later
+   stages inherit ambiguity, and the step no longer provides a clear
+   compression boundary. The current design accepts this tradeoff because
+   step 4 (FocusOnVitalFewLevers) handles further filtering.
+
+7. **Single-batch reasoning is powerful but brittle.** The model sees all
+   levers at once (good for global consistency) but one bad completion
+   corrupts everything (no repair mechanism). Adding a validation pass
+   over survivors would catch the most common failure modes without
+   requiring a full architectural change.