🪟 Working on Context Window Optimization for 3.1! #731
Replies: 11 comments 1 reply
-
|
That's significant. My PAI 3.0 + (my own stuff) loads about 32% of my Opus budget... It's performing exceptionally well but I always need to plan around context rot and planning so I avoid approaching 80% or more of the context budget. |
Beta Was this translation helpful? Give feedback.
-
|
I was just looking at this Daniel, thanks for your work on it! Question: can we lazy load via skill?
|
Beta Was this translation helpful? Give feedback.
-
|
Great to see this happening — we've been working on the same problem from a different angle and have some analysis that might be useful input for Algorithm 2.0. Profiling v1.8.0We profiled every section of the current Algorithm and classified by whether it requires LLM reasoning or is purely procedural:
Only 12% genuinely requires LLM reasoning. The rest is rules a hook could enforce, data a file could store, or formatting a template could handle. Key finding: positional reinforcementWe initially flagged the Quality Gate block appearing 3x as waste. On closer analysis, the inline copies in OBSERVE and PLAN serve as positional reinforcement — LLMs attend more strongly to instructions physically near where they need to be applied. The standalone reference copy is removable, but the inline copies are intentional design. Worth preserving this principle in 2.0. Designer Intent check (PR #742)This finding led us to propose a small addition to THINK's pressure test — a Connection to lazy loading@timgrote's suggestion about lazy loading via skill aligns with our ContextRouter approach (issue #690). We built a working tiered loader that cuts greeting context by 71% and standard tasks by 28% by injecting only the Algorithm tiers needed per-prompt. Happy to share implementation details if useful for 3.1. Our optimization strategies (on hold pending 2.0)We had four strategies planned: dedup (~920 tokens), externalize reference data (~3,400 tokens), hook-based enforcement (~4,000 tokens), and multi-model routing for mechanical operations. Putting implementation on hold since 2.0 will likely address the same structural issues differently. Happy to contribute analysis or testing. |
Beta Was this translation helpful? Give feedback.
-
|
One more thought on optimization — beyond reducing Algorithm size, there's a complementary angle: which model runs which operation. Not everything the Algorithm does requires frontier-model reasoning. During our profiling, we classified operations by reasoning demand:
This doesn't reduce context window — it reduces cost per Algorithm run by ~25-35% by routing mechanical operations to cheaper models via hooks and subagents. Might be worth considering as part of the 2.0 architecture: if the Algorithm is already getting modular at 274 lines, the infrastructure to route different phases to different models could be a natural extension. Happy to share our implementation analysis if useful. Holding off on building it ourselves until 2.0 lands. |
Beta Was this translation helpful? Give feedback.
-
|
Yeah, this is great analysis. I appreciate it, but determining what goes into the voice analysis and tab titles and stuff like that still involves summarization of content, which is still inference. I do hear your point on looking for places where it can be code and where we can use lower models. It's a great point. That's why I have three levels of inference in the Inference tool. You should be using the inference tool whenever you can because it uses the built-in subscription. |
Beta Was this translation helpful? Give feedback.
-
|
Yeah, unfortunately I reverted back to 2.24 the next day because my context window bloated so bad on the 3.0 version. There was one feature that I eventually grabbed from the 3.0 upgrade and put it back into my 2.24. I don’t remember what it’s called. It was like part of the stage seven verify where the system generates specific verification steps. This was very useful and only .2% of the context window by itself. |
Beta Was this translation helpful? Give feedback.
-
|
Context size fix on the way!
…On Sat, Feb 21, 2026 at 1:15 PM, jarednwilson < ***@***.*** > wrote:
Yeah, unfortunately I reverted back to 2.24 the next day because my
context window bloated so bad on the 3.0 version. There was one feature
that I eventually grabbed from the 3.0 upgrade and put it back into my
2.24. I don’t remember what it’s called. It was like part of the stage
seven verify where the system generates specific verification steps. This
was very useful and only .2% of the context window by itself.
—
Reply to this email directly, view it on GitHub (
#731 (comment)
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AAAMLXSRR3JFDOAPEZF6XCL4NDDGHAVCNFSM6AAAAACVS4QODWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKOBYGQ3DKNA
).
You are receiving this because you authored the thread. Message ID: <danielmiessler/Personal_AI_Infrastructure/repo-discussions/731/comments/15884654
@ github. com>
|
Beta Was this translation helpful? Give feedback.
-
|
Update: ContextRouter measured results (4 days, 94 prompts) Promised to share real numbers once we had them. Our tiered context loader (issue #690) has been running since Feb 18. Here's what we're seeing:
Blended result: ~36% effective reduction across all prompts. Estimated ~700K tokens saved over the 4-day period. The standard tier does the heavy lifting — most prompts don't need the advanced Algorithm phases (constraint extraction, PRD decomposition, etc.), so we skip injecting them. Extended kicks in when the CapabilityRecommender classifies a task as complex. This is a stopgap until Algorithm 2.0 lands. Daniel's 274-line rewrite will likely make the tiered loader unnecessary. But for anyone running 1.8.0 and hitting context pressure, the approach works. Implementation details in issue #690 if anyone wants to try it. |
Beta Was this translation helpful? Give feedback.
-
|
Relevant data point for the v3.1 optimization work: the context size problem is compounded by upstream Claude Code bugs. Claude Code's compaction process treats hook-injected Separately, Opus 4.6 has a documented instruction-following regression (#24991 — 92/100 → 38/100 on same model ID). Even when instructions survive compaction, the model may deprioritize them. I compiled the full upstream evidence here: #792 For v3.1, this suggests context size reduction alone isn't sufficient — the architecture also needs a way to re-inject critical context after compaction and/or enforce key behaviors programmatically rather than via text instructions. |
Beta Was this translation helpful? Give feedback.
-
|
Closing — the context window optimization goal described here was achieved in PAI v4.0.0. Results:
This was the whole point of v4.0 — lean and mean. Full release notes: |
Beta Was this translation helpful? Give feedback.
-
Hidden context cost: Algorithm file re-reads + settings.json reads per phaseFound a significant context waste pattern in v4.0's lazy-loading architecture that's worth sharing. The problemv4.0 moved the Algorithm from embedded-in-SKILL.md to lazy-loaded from a separate file. CLAUDE.md says "MANDATORY FIRST ACTION: Read PAI/Algorithm/v3.5.0.md." This fires on every ALGORITHM-classified turn. The issue: Read tool outputs accumulate in conversation history. Each Read result is a message that persists for all future API calls. So:
With the Algorithm-default ModeClassifier (everything except greetings/ratings/thanks → ALGORITHM), ~80% of turns trigger a re-read. The file doesn't change between turns — pure waste. Compounding factor: The Algorithm also says "Read settings.json" at every phase transition to extract a voice notification text. That's 7 reads × 27KB = ~189KB per Algorithm run for ~50 bytes of useful data (~50,400 tokens wasted). Total waste in a 10-turn session: ~105,000 tokens from redundant reads, consuming ~61% of available context budget after fixed costs. The ironyLazy-loading was designed to save context vs the old embedded approach. But:
Lazy-loading is only cheaper when the file is rarely needed. With Algorithm-default classification, it's needed on nearly every turn — making it more expensive than embedding after turn 2. Our fix (preserves lazy-loading, eliminates redundant reads)
Implementation details in this gist (just updated with the fix). Takeaway for context optimizationThe general principle: static instructions belong in the system prompt (constant cost), not in conversation history (cumulative cost). If content doesn't change between turns, reading it via the Read tool on every turn is quadratically worse than having it in CLAUDE.md. Lazy-loading is great when triggered rarely — but "rarely" and "80% of turns" are different things. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently experimenting with Algorithm 2.0. And guess what?
It's 274 lines instead of almost 1300!
Doing a bunch of context window optimization for 3.1 !!!
Beta Was this translation helpful? Give feedback.
All reactions