Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
358 changes: 358 additions & 0 deletions docs/proposals/122-deduplicate-levers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
# Proposal 122: Deduplicate Levers — Architecture and Improvements

## Status

**Current state**: PR #375 merged (2026-03-21). Single batch call with
primary/secondary/remove taxonomy. 3x faster than baseline, all models
complete successfully. Best iteration: 52.

**This proposal**: documents the iteration journey, known issues, and
future improvement directions for the deduplicate_levers step.

---

## Pipeline Context

DeduplicateLevers is step 2 in a 6-step solution-space exploration pipeline:

1. **IdentifyPotentialLevers** — brainstorms 15-20 raw levers
2. **DeduplicateLevers** ← this step
3. **EnrichLevers** — adds description, synergy, and conflict text
4. **FocusOnVitalFewLevers** — filters down to 4-6 high-impact levers
5. **ScenarioGeneration** — builds 3 scenarios (aggressive, medium, safe)
6. **ScenarioSelection** — picks the best-fitting scenario

Step 1 intentionally over-generates. This step removes near-duplicates,
filters irrelevant levers, and tags survivors as primary (strategic) or
secondary (operational). Step 4 handles further filtering.

---

## Iteration History (iter 44-52)

Nine iterations across five PRs to reach the current state.

| Iter | PR | Architecture | Taxonomy | Verdict | Key insight |
|------|-----|-------------|----------|---------|-------------|
| 48 | — | 18 sequential calls | keep/absorb/remove | BASELINE | llama3.1 collapsed 7 levers into "Risk Framing" |
| 45 | #365 | 18 sequential calls | primary/secondary/absorb/remove (4-way) | YES (5/7) | Primary/secondary triage is the real quality gain. `remove` dead when `absorb` exists |
| 49 | #372 | 18 sequential calls | primary/secondary/remove (3-way) | YES | All 3 categories exercised. Template-lock identified |
| 50 | #373 | 1 batch call | Likert scoring (-2 to +2) | REVERT | Relevance != deduplication. llama3.1 inverted the scale |
| 51 | #374 | 1 batch call | primary/secondary/remove | YES | Batch + categorical works. llama3.1 timed out 2/5 plans |
| 52 | #375 | 1 batch call | primary/secondary/remove | YES (merged) | Shorter justifications fixed llama3.1 timeout |

### What moved the needle

1. **Single batch call** (iter 50-52): 18 calls → 1. 3x faster, no
position bias, global consistency, simpler code (190 vs 330 lines).

2. **Primary/secondary triage** (iter 45+): New downstream signal. Main
branch only had `keep` — no prioritization information.

3. **Shorter justifications** (iter 52): ~20-30 words instead of ~40-80.
Fixed llama3.1 timeout, API models 55% shorter and 25% faster.

### What didn't move the needle

- **Taxonomy label changes**: Renaming keep→primary, absorb→remove produced
nearly identical results. Labels are interchangeable.
- **Anti-template-lock instructions**: Not needed with short categorical labels.
- **Calibration hints**: Models that remove aggressively do so regardless.
Conservative models ignore calibration guidance.

### Current metrics (iter 52 vs baseline)

| Metric | Baseline (iter 48) | Current (iter 52) |
|--------|-------------------|-------------------|
| Architecture | 18 sequential calls | 1 batch call |
| Taxonomy | keep/absorb/remove | primary/secondary/remove |
| Triage signal | None | primary 54% / secondary 31% |
| Avg kept | 13.9 / 18 | 15.6 / 18 |
| Avg removed | 4.1 / 18 (23%) | 2.4 / 18 (15%) |
| Avg duration | 120.5s | 40.3s |
| llama3.1 failures | Collapse into "Risk Framing" | None |

---

## Known Issues

### Structural issues

#### 1. The step conflates deduplication with prioritization

The current schema asks the LLM to make two decisions at once:
- Whether a lever survives deduplication (keep vs remove)
- Whether a surviving lever is strategically important (primary vs secondary)

A lever can be clearly distinct but low priority, or highly important but
partly redundant with another broader lever. By fusing these decisions,
the step creates a bias toward keeping anything that seems important, even
if it overlaps heavily with another lever.

The step drifts toward "strategic triage" rather than real overlap reduction.

#### 2. No explicit absorption structure

When a lever is removed, only a freeform justification is stored. There is
no structured `absorbed_into` field. The pipeline loses information about
which surviving lever subsumed the removed one.

Without absorption links:
- You cannot audit whether removal was correct
- You cannot detect hierarchy reversals (narrow lever kept, general removed)
- You cannot detect chain absorptions (A→B→C where B is also removed)
- Later stages cannot recover wording or evidence from removed items

#### 3. The retention bias is strong

The uncertainty rules produce predictable skew:
- Uncertain between primary and secondary → primary (promotes)
- Uncertain between keep and remove → secondary (keeps)
- Missing decisions → secondary (keeps)

The model is rewarded for being vague. The step becomes a low-risk
classifier instead of a real deduplicator.

#### 4. No survivor-overlap validation

No post-check asks whether surviving items still overlap heavily. Two
survivors might be different phrasings of the same lever, or policy,
procurement, and standards variants of one underlying mechanism.

#### 5. No category balance awareness

The flat comparison setup encourages the model to favor narratively vivid
levers (creative, character, thematic) over less glamorous but equally
critical ones (financing, legal, operations, distribution).

#### 6. Single-batch reasoning is useful but brittle

One batch call gives global context, but one bad completion corrupts the
whole output. There is no repair pass for missing IDs, suspicious
distributions, or ambiguous overlap clusters.

### Implementation issues

#### 7. Silent failure masking

When the LLM call times out, `batch_result` stays `None`, all levers
default to secondary, and `outputs.jsonl` records `status=ok` with
`calls_succeeded=1`. Monitoring pipelines cannot detect these failures.

#### 8. `user_prompt` field stores wrong value

`user_prompt=project_context` at line 272 stores the plan description, not
the full assembled prompt including the levers JSON. The saved artifact
cannot reconstruct the exact LLM input.

#### 9. `calls_succeeded` hardcoded

`runner.py` returns `calls_succeeded=1` regardless of whether the LLM call
succeeded or the fallback fired.

#### 10. Minimum count threshold is too low

`max(3, len(input_levers) // 4)` = 4 for 18 levers. A model removing 14/18
still clears the warning. Consider `max(5, len(input_levers) // 3)`.

---

## Improvement Proposals

### Option A: Incremental improvements (low risk)

Keep the current single-call architecture and taxonomy. Fix implementation
issues and add validation.

**Changes:**
1. Add `absorbed_into: str | None` field to `LeverClassificationDecision`.
Required when classification is `remove` and the lever overlaps another.
Enables merge-graph validation.

2. Add `remove_reason: Literal["duplicate", "subset", "irrelevant", "too_narrow"]`
to make removals auditable and categorized.

3. Add a survivor-overlap validation pass. After the main classification,
compute similarity between surviving levers (by name/consequences overlap).
If two survivors are suspiciously similar, log a warning or trigger a
focused comparison.

4. Replace `missing → secondary` with `missing → unresolved`. Send unresolved
items through a focused repair call rather than silently keeping them.

5. Fix observability: expose `llm_call_succeeded` from `DeduplicateLevers`,
emit `classification_fallback` events in `events.jsonl`, fix `user_prompt`
field, make `calls_succeeded` reflect reality.

6. Add calibration checks: warn if >70% survive (likely under-dedup) or
<35% survive (likely over-removal). Optionally trigger a repair pass.

**Effort**: Medium. Each change is independent and can be shipped separately.

### Option B: Two-pass architecture (medium risk)

Separate deduplication from prioritization into distinct passes.

**Pass 1 — Overlap resolution:**
- Decisions: `keep` / `absorb` / `remove`
- Only answers: does this lever survive as a distinct concept?
- Requires `absorbed_into` for absorb decisions

**Pass 2 — Prioritization of survivors:**
- Decisions: `primary` / `secondary`
- Only answers: among surviving levers, which are top-level strategic?

**Benefits:**
- Cleaner separation of concerns
- Deduplication quality not contaminated by importance assessment
- Each pass is simpler for the LLM

**Risks:**
- Two LLM calls instead of one (cost, latency)
- More complex orchestration
- Pass 2 may disagree with pass 1's keep decisions

**Effort**: High. New schemas, two-call orchestration, compatibility with
existing runner and analysis pipeline.

### Option C: Cluster-based deduplication (higher risk)

Instead of item-level labels, first cluster semantically similar levers,
then pick representatives within each cluster.

**Step 1**: Group all levers into semantic clusters in one call.
**Step 2**: For each cluster with >1 lever, pick the canonical representative.
Tag as primary/secondary. Mark others as absorbed.

**Output shape:**
```json
{
"cluster_id": "procurement-conditions",
"canonical": "Procurement Conditionality",
"absorbed": [
{
"lever_id": "...",
"reason": "near_duplicate",
"absorbed_into": "Procurement Conditionality"
}
]
}
```

**Benefits:**
- Most interpretable output
- Natural absorption structure
- Explicit nearest-neighbor reasoning within clusters

**Risks:**
- Most implementation work
- Clustering quality depends on model capability
- Output schema change affects downstream consumers

**Effort**: High. New clustering schema, new orchestration, downstream
consumer updates.

### Option D: Mechanism-based deduplication (research)

Force the model to decompose each lever into structured fields before
deduplicating:

- Target actor
- Intervention mechanism
- Expected effect
- Time horizon
- Implementation domain

Then deduplicate primarily on mechanism + actor + effect, not just
semantic similarity. This would reduce both false merges (same topic,
different mechanism) and false splits (different wording, same mechanism).

**Benefits:**
- Most principled approach to deduplication
- Catches "sounds similar but different mechanism" false merges

**Risks:**
- Significantly more output per lever (structured decomposition)
- May exceed token budgets for weak models
- Research-grade — untested in this pipeline

**Effort**: Very high. New decomposition schema, new comparison logic,
likely two calls minimum.

---

## Recommendation

### Near-term (next 1-2 iterations)

Implement Option A items 1-2 and 5:
- Add `absorbed_into` field for auditable removal
- Add `remove_reason` categorization
- Fix observability (silent failure masking, `user_prompt`, `calls_succeeded`)

These are low-risk, independent changes that make the step more inspectable
without changing its behavior.

### Medium-term (next 3-5 iterations)

Implement Option A items 3-4 and 6:
- Survivor-overlap validation pass
- Replace `missing → secondary` with `missing → unresolved` + repair
- Calibration checks with optional second pass

### Long-term (future consideration)

Evaluate Option B (two-pass) or Option C (cluster-based) based on whether
the incremental improvements are sufficient. The current single-call
architecture may hit a quality ceiling where one LLM call cannot reliably
do both overlap detection and prioritization well. If that happens, Option B
is the natural next step.

Option D (mechanism-based) is interesting but too expensive for the current
model roster. Worth revisiting when cheaper structured-output models become
available.

---

## Lessons Learned

### From 9 iterations of optimization

1. **Architecture matters more than taxonomy.** Changing labels (keep vs
primary, absorb vs remove) across 5 iterations produced nearly identical
results. Changing from 18 sequential calls to 1 batch call produced a
3x speedup and eliminated position bias.

2. **Relevance and deduplication are different questions.** A lever can be
highly relevant to the plan AND fully redundant with another lever.
Asking "how relevant?" (iter 50, Likert scoring) produced 0% removal
for capable models. Asking "is this redundant?" (iter 51-52, categorical)
restored deduplication.

3. **Integer scales can be inverted; categorical labels cannot.** llama3.1
scored 17/18 levers as -2 while writing "highly relevant" in
justifications. This failure mode is structurally impossible with
categorical labels.

4. **Output length directly affects model completion.** Shortening
justifications from ~40-80 words to ~20-30 words let llama3.1 finish
within timeout on all plans. Advisory length constraints work for API
models but are fragile for local models.

5. **The step currently mixes two goals.** Deduplication (overlap reduction)
and prioritization (primary vs secondary) are fused into one decision.
This creates a retention bias — anything that seems important survives,
even if it overlaps. Separating these concerns is the most promising
architectural improvement.

6. **Conservative retention is a valid design choice** for an intermediate
pipeline stage, but it has consequences: the output is noisier, later
stages inherit ambiguity, and the step no longer provides a clear
compression boundary. The current design accepts this tradeoff because
step 4 (FocusOnVitalFewLevers) handles further filtering.

7. **Single-batch reasoning is powerful but brittle.** The model sees all
levers at once (good for global consistency) but one bad completion
corrupts everything (no repair mechanism). Adding a validation pass
over survivors would catch the most common failure modes without
requiring a full architectural change.
Loading