🪟 Working on Context Window Optimization for 3.1! #731

danielmiessler · 2026-02-18T19:53:12Z

danielmiessler
Feb 18, 2026
Maintainer

Currently experimenting with Algorithm 2.0. And guess what?

It's 274 lines instead of almost 1300!

Doing a bunch of context window optimization for 3.1 !!!

realGregorVucajnk · 2026-02-18T20:46:23Z

realGregorVucajnk
Feb 18, 2026

That's significant. My PAI 3.0 + (my own stuff) loads about 32% of my Opus budget... It's performing exceptionally well but I always need to plan around context rot and planning so I avoid approaching 80% or more of the context budget.

0 replies

timgrote · 2026-02-20T00:11:21Z

timgrote
Feb 20, 2026

I was just looking at this Daniel, thanks for your work on it!

Question: can we lazy load via skill?

Lazy load via skill
Remove Algorithm from startup context. Load it on-demand when the THEALGORITHM
skill is invoked or when the CapabilityRecommender hook detects a non-trivial
task. Saves ~20k tokens on simple conversations.

0 replies

jlacour-git · 2026-02-20T09:29:02Z

jlacour-git
Feb 20, 2026

Great to see this happening — we've been working on the same problem from a different angle and have some analysis that might be useful input for Algorithm 2.0.

Profiling v1.8.0

We profiled every section of the current Algorithm and classified by whether it requires LLM reasoning or is purely procedural:

Classification	% of Algorithm	Words	Description
PROCEDURAL	61%	6,988	Deterministic rules, gates, checklists — enforceable by hooks/code
REFERENCE	22%	2,553	Static tables, templates, registries — loadable on demand
JUDGMENT	12%	1,358	Reverse engineering, pressure testing, creative ISC — must stay in prompt
FORMATTING	4%	471	Output templates — partially externalizable

Only 12% genuinely requires LLM reasoning. The rest is rules a hook could enforce, data a file could store, or formatting a template could handle.

Key finding: positional reinforcement

We initially flagged the Quality Gate block appearing 3x as waste. On closer analysis, the inline copies in OBSERVE and PLAN serve as positional reinforcement — LLMs attend more strongly to instructions physically near where they need to be applied. The standalone reference copy is removable, but the inline copies are intentional design. Worth preserving this principle in 2.0.

Designer Intent check (PR #742)

This finding led us to propose a small addition to THINK's pressure test — a [DESIGNER INTENT] bullet (~30 words) that forces the assistant to articulate why existing design choices were made before proposing changes. Chesterton's Fence for the Algorithm. PR #742 targets v1.8.0 but the concept is architecture-independent.

Connection to lazy loading

@timgrote's suggestion about lazy loading via skill aligns with our ContextRouter approach (issue #690). We built a working tiered loader that cuts greeting context by 71% and standard tasks by 28% by injecting only the Algorithm tiers needed per-prompt. Happy to share implementation details if useful for 3.1.

Our optimization strategies (on hold pending 2.0)

We had four strategies planned: dedup (~920 tokens), externalize reference data (~3,400 tokens), hook-based enforcement (~4,000 tokens), and multi-model routing for mechanical operations. Putting implementation on hold since 2.0 will likely address the same structural issues differently. Happy to contribute analysis or testing.

0 replies

jlacour-git · 2026-02-20T09:33:02Z

jlacour-git
Feb 20, 2026

One more thought on optimization — beyond reducing Algorithm size, there's a complementary angle: which model runs which operation.

Not everything the Algorithm does requires frontier-model reasoning. During our profiling, we classified operations by reasoning demand:

Operation	Reasoning needed	Could run on
ISC format validation (word count, verb check)	None	Pure code (no model)
Voice notifications	None	Pure code
Quality Gate checks	Low — mechanical checklist	Haiku or code
PRD templating	Low — fill template	Haiku or code
Effort level classification	Low	Haiku (CapabilityRecommender already does this)
Capability audit	Medium — structured matching	Sonnet
Constraint extraction	Medium — pattern scanning	Sonnet
Reverse engineering, ISC generation, pressure testing	High — creative judgment	Must stay on full model

This doesn't reduce context window — it reduces cost per Algorithm run by ~25-35% by routing mechanical operations to cheaper models via hooks and subagents.

Might be worth considering as part of the 2.0 architecture: if the Algorithm is already getting modular at 274 lines, the infrastructure to route different phases to different models could be a natural extension.

Happy to share our implementation analysis if useful. Holding off on building it ourselves until 2.0 lands.

0 replies

danielmiessler · 2026-02-21T00:51:03Z

danielmiessler
Feb 21, 2026
Maintainer Author

Yeah, this is great analysis. I appreciate it, but determining what goes into the voice analysis and tab titles and stuff like that still involves summarization of content, which is still inference. I do hear your point on looking for places where it can be code and where we can use lower models. It's a great point.

That's why I have three levels of inference in the Inference tool.

You should be using the inference tool whenever you can because it uses the built-in subscription.

1 reply

jlacour-git Feb 21, 2026

Thanks Daniel, really appreciate the feedback!

And the nudge on the Inference tool. I'll be honest, I'm still learning the full system and missed that aspect.

On the code vs. inference line — I built a ContextRouter hook that classifies prompts into context tiers (greeting/skill/standard/extended) using pure regex keyword matching. It runs in <5ms and decides how much Algorithm context to inject per-prompt.

The idea was: this classification is structural (does it start with /? Is it under 40 chars and matches a greeting pattern?), not semantic, so code felt right. But anything that needs actual understanding — depth classification, capability recommendations, sentiment — those go through Inference.ts. Does that split align with what you had in mind?

My broader motivation: I'm aware of the risk to over-optimize, or to optimize the wrong things. On the other hand, I'm generally allergic to waste. So if I can avoid waste with reasonable effort, I will...even if it might not be strictly necessary. E.g. if I can save 50KB of context on a greeting without losing quality, that feels worth doing.

And fyi: on a normal Pro subscription, PAI was basically unusable. I switched to Max, and that is fine for now. But I understand one reason might be that Anthropic is very(!) generous on usage limits at the moment. Which might change.

jarednwilson · 2026-02-21T21:14:55Z

jarednwilson
Feb 21, 2026

Yeah, unfortunately I reverted back to 2.24 the next day because my context window bloated so bad on the 3.0 version. There was one feature that I eventually grabbed from the 3.0 upgrade and put it back into my 2.24. I don’t remember what it’s called. It was like part of the stage seven verify where the system generates specific verification steps. This was very useful and only .2% of the context window by itself.

0 replies

danielmiessler · 2026-02-21T21:23:42Z

danielmiessler
Feb 21, 2026
Maintainer Author

Context size fix on the way!

…

On Sat, Feb 21, 2026 at 1:15 PM, jarednwilson < ***@***.*** > wrote: Yeah, unfortunately I reverted back to 2.24 the next day because my context window bloated so bad on the 3.0 version. There was one feature that I eventually grabbed from the 3.0 upgrade and put it back into my 2.24. I don’t remember what it’s called. It was like part of the stage seven verify where the system generates specific verification steps. This was very useful and only .2% of the context window by itself. — Reply to this email directly, view it on GitHub ( #731 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAAMLXSRR3JFDOAPEZF6XCL4NDDGHAVCNFSM6AAAAACVS4QODWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKOBYGQ3DKNA ). You are receiving this because you authored the thread. Message ID: <danielmiessler/Personal_AI_Infrastructure/repo-discussions/731/comments/15884654 @ github. com>

0 replies

jlacour-git · 2026-02-22T13:49:13Z

jlacour-git
Feb 22, 2026

Update: ContextRouter measured results (4 days, 94 prompts)

Promised to share real numbers once we had them. Our tiered context loader (issue #690) has been running since Feb 18. Here's what we're seeing:

Tier	Injected	Saved vs full SKILL.md	Hit rate
Standard (tier-0 + phases)	40KB	51% reduction	59% of prompts
Extended (all tiers)	69KB	16% reduction	41% of prompts
Minimal (core only)	14KB	83% reduction	0% (not triggered yet)

Blended result: ~36% effective reduction across all prompts. Estimated ~700K tokens saved over the 4-day period.

The standard tier does the heavy lifting — most prompts don't need the advanced Algorithm phases (constraint extraction, PRD decomposition, etc.), so we skip injecting them. Extended kicks in when the CapabilityRecommender classifies a task as complex.

This is a stopgap until Algorithm 2.0 lands. Daniel's 274-line rewrite will likely make the tiered loader unnecessary. But for anyone running 1.8.0 and hitting context pressure, the approach works.

Implementation details in issue #690 if anyone wants to try it.

0 replies

badosanjos · 2026-02-24T15:04:10Z

badosanjos
Feb 24, 2026

Relevant data point for the v3.1 optimization work: the context size problem is compounded by upstream Claude Code bugs.

Claude Code's compaction process treats hook-injected <system-reminder> content as regular conversation history and can summarize or drop it entirely (claude-code #27954, #24460). This means PAI's LoadContext output is vulnerable to being lost during any compaction event — and the larger the initial injection, the sooner compaction fires.

Separately, Opus 4.6 has a documented instruction-following regression (#24991 — 92/100 → 38/100 on same model ID). Even when instructions survive compaction, the model may deprioritize them.

I compiled the full upstream evidence here: #792

For v3.1, this suggests context size reduction alone isn't sufficient — the architecture also needs a way to re-inject critical context after compaction and/or enforce key behaviors programmatically rather than via text instructions.

0 replies

kaimagnus · 2026-03-01T01:26:56Z

kaimagnus
Mar 1, 2026
Collaborator

Closing — the context window optimization goal described here was achieved in PAI v4.0.0.

Results:

Context at startup: ~38% → ~19% (50% reduction)
38 flat skill directories compressed into 13 hierarchical categories
3 dead systems consuming context removed entirely
Dynamic CLAUDE.md template system replaces static file
Algorithm reduced from ~1300 lines to a lean versioned spec

This was the whole point of v4.0 — lean and mean.

Full release notes: Releases/v4.0.0

0 replies

jlacour-git · 2026-03-01T06:55:17Z

jlacour-git
Mar 1, 2026

Hidden context cost: Algorithm file re-reads + settings.json reads per phase

Found a significant context waste pattern in v4.0's lazy-loading architecture that's worth sharing.

The problem

v4.0 moved the Algorithm from embedded-in-SKILL.md to lazy-loaded from a separate file. CLAUDE.md says "MANDATORY FIRST ACTION: Read PAI/Algorithm/v3.5.0.md." This fires on every ALGORITHM-classified turn.

The issue: Read tool outputs accumulate in conversation history. Each Read result is a message that persists for all future API calls. So:

Turn 1: 1 copy of Algorithm in history (25KB, ~6,800 tokens)
Turn 2: 2 copies
Turn 5: 5 copies
Turn 8: 8 copies = ~54,400 tokens of redundant content

With the Algorithm-default ModeClassifier (everything except greetings/ratings/thanks → ALGORITHM), ~80% of turns trigger a re-read. The file doesn't change between turns — pure waste.

Compounding factor: The Algorithm also says "Read settings.json" at every phase transition to extract a voice notification text. That's 7 reads × 27KB = ~189KB per Algorithm run for ~50 bytes of useful data (~50,400 tokens wasted).

Total waste in a 10-turn session: ~105,000 tokens from redundant reads, consuming ~61% of available context budget after fixed costs.

The irony

Lazy-loading was designed to save context vs the old embedded approach. But:

Embedded (system prompt): constant cost — same content sent per call, never cumulative
Lazy-loaded (conversation history): cumulative cost — each Read persists for all future calls

Lazy-loading is only cheaper when the file is rarely needed. With Algorithm-default classification, it's needed on nearly every turn — making it more expensive than embedding after turn 2.

Our fix (preserves lazy-loading, eliminates redundant reads)

Moved the Complexity Gate from Algorithm file to CLAUDE.md, before the Read instruction. The gate evaluates with full Opus context whether the task is genuinely complex. If not → NATIVE directly, Algorithm file never loaded. Previously the gate was inside the Algorithm file, so the 25KB read happened before the gate could decide it wasn't needed.
Made the Algorithm Read conditional: "Read only if this is the first ALGORITHM-classified turn of the session, or if context was just compacted." No redundant re-reads.
Embedded phase voice texts in CLAUDE.md via BuildCLAUDE.ts: Extracts phaseNotifications from settings.json at build time and injects them as a small block (~200 bytes). Eliminates all 7 settings.json reads per Algorithm run.

Implementation details in this gist (just updated with the fix).

Takeaway for context optimization

The general principle: static instructions belong in the system prompt (constant cost), not in conversation history (cumulative cost). If content doesn't change between turns, reading it via the Read tool on every turn is quadratically worse than having it in CLAUDE.md. Lazy-loading is great when triggered rarely — but "rarely" and "80% of turns" are different things.

0 replies

Uh oh!

🪟 Working on Context Window Optimization for 3.1! #731

Uh oh!

danielmiessler Feb 18, 2026 Maintainer

Replies: 11 comments · 1 reply

Uh oh!

Uh oh!

Uh oh!

Profiling v1.8.0

Key finding: positional reinforcement

Designer Intent check (PR #742)

Connection to lazy loading

Our optimization strategies (on hold pending 2.0)

Uh oh!

Uh oh!

danielmiessler Feb 21, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

danielmiessler Feb 21, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

kaimagnus Mar 1, 2026 Collaborator

Uh oh!

Hidden context cost: Algorithm file re-reads + settings.json reads per phase

The problem

The irony

Our fix (preserves lazy-loading, eliminates redundant reads)

Takeaway for context optimization

danielmiessler
Feb 18, 2026
Maintainer

Replies: 11 comments 1 reply

danielmiessler
Feb 21, 2026
Maintainer Author

danielmiessler
Feb 21, 2026
Maintainer Author

kaimagnus
Mar 1, 2026
Collaborator