Status: Draft, gated. Validated 2026-05-05 against deepseek-v4-flash on 6 real coding tasks. Whether to actually build is a separate call — see "Decision" at the end.
Motivation
Every Reasonix session today re-discovers the same repo conventions: test runner, import suffix, file layout, tool-registration pattern, where common types live. Each rediscovery burns tool calls and tokens.
OpenClaw / Hermes use procedural memory (skill capture) for general-purpose agents. The interesting application for a coding agent is narrower: capture environmental facts about this specific repo so subsequent sessions skip rediscovery.
The original framing — "越用越便宜 (gets cheaper as you use it)" — turns out to be too broad. The honest framing is narrower but still real, and the validation data backs it.
Validation summary
6 paired tasks, 3 runs each, cold (no lessons in prompt) vs warm (lessons in prompt). Same model, same temperature, same tool set.
| Task |
Reduction |
Win rate |
Notes |
| Write a unit test for one method |
-64% |
3/3 |
Convention-heavy, simple scope |
| Create a new tool file matching pattern |
-69% |
3/3 |
Pattern-matching against existing tools |
| Targeted bug fix in known function |
-32% |
2/3 |
Clear scope |
| Add method to complex class |
+30% |
0/3 |
Task ceiling above flash; lessons add overhead |
| Cross-file investigation |
~0% |
— |
Exploration-shaped, both modes cap-bound |
| Tool-addition planning |
~0% |
— |
Exploration-shaped, both modes cap-bound |
3/6 tasks reproducibly save 32%–69%, ~55% mean — all on tasks where the agent generates or modifies code that must match repo conventions, and the model can converge inside the turn budget.
The 3 negatives have clear, separate reasons:
- Add-method: task is harder than
v4-flash can solve in 30 turns regardless of lessons. Lessons only add prompt overhead.
- Cross-file / planning: exploration-shaped. Lesson hints don't help direction; the agent gets lost in tool calls regardless.
Honest pitch
❌ "Reasonix gets cheaper as you use it (across the board)"
✅ "Reasonix learns your repo. From the second session, tasks that write/modify code following your conventions cost 30–70% fewer tokens."
This is the bulk of real coding-agent work (write test, add function, modify implementation, fix bug, add tool). The pitch is narrower than the original framing, but it's reproducible and demoable side-by-side.
Design
Naming
lessons — distinct from existing skills (= playbooks like /explore, /review).
| Concept |
Source |
Trigger |
Example |
| skill (existing) |
Pre-authored by user/builtin |
User invokes via /skill or run_skill() |
/review runs the review playbook |
| lesson (new) |
Auto-extracted from session history |
Auto-injected on classified code-gen tasks |
"tests live in tests/**/*.test.ts" |
Storage
.reasonix/lessons/ # project scope (committed)
testing.md
imports.md
tools-registration.md
~/.reasonix/lessons/ # global scope (personal, cross-repo)
general-preferences.md
Format mirrors existing skills.ts frontmatter parser:
---
name: testing
description: Test runner + file convention
fingerprint:
- package.json#scripts.test
- tests/
captured-at: 2026-05-05
---
- vitest, tests/**/*.test.ts, run via `npm test`
- import style: .js suffix even for .ts source (ESM-NodeNext)
Capture (auto-extraction)
At session end, a reducer pass over events.jsonl:
- Identify successful turn patterns: tool sequence → final answer accepted by user (or session completed without rejection).
- Use a cheap extraction call (
deepseek-v4-flash, thinking: "disabled", ~600 max tokens) — same pattern as the existing harvest.ts.
- Output candidate lesson to
.reasonix/lessons/_candidates/<name>.md.
- Promotion path: user runs
reasonix lessons promote <name>, OR auto-promote after 7 days with no rejection.
Injection (selective, NOT always)
Critical decision from validation. Do NOT inject lessons unconditionally. T2/T3 (exploration) and T4 (cap-bound) showed lesson overhead with no upside.
Pre-flight classifier (single cheap call before the main loop starts):
- Read user request
- Classify into one of:
code-gen | exploration | qa | other
- Inject lessons ONLY for
code-gen
- If classifier is unsure → default to NOT inject (the failure cost of false-skip is small; the cost of false-inject on an exploration task is wasted prompt budget every turn)
Injected lesson budget: cap at ~200 tokens. The validated LESSONS was 297 chars (~75 tokens) — that's the right order of magnitude.
Fingerprint / staleness
Each lesson carries a fingerprint listing files/fields it depends on. On match:
- Hash fingerprint inputs at injection time.
- If hash diverges → mark lesson
stale, skip injection, fall back to discovery, re-write the lesson with the new hash.
reasonix lessons stale surfaces drift.
This is the load-bearing reliability mechanism: without it, the lesson library slowly turns into misinformation.
CLI surface
reasonix lessons list # all lessons across scopes
reasonix lessons show <name>
reasonix lessons promote <name> # candidate → real
reasonix lessons rm <name>
reasonix lessons stale # fingerprint mismatches
Architecture fit
The v0.14 event-log kernel covers ~80% of the infra:
| Need |
Existing piece |
| Event capture |
events.jsonl + reducers |
| Cheap extraction |
harvest.ts (typed-plan-state pattern) |
| Frontmatter + per-scope storage |
SkillStore is reusable |
| Token-savings UI |
telemetry/usage.ts |
New code: ~3 modules (lessons/store.ts, lessons/capture.ts, lessons/classifier.ts) + 1 CLI subcommand.
Non-goals
- Not a replacement for
skills. Distinct concept, parallel namespace.
- Not always-on injection. Lessons must justify their prompt cost per task type.
- Not cross-user lesson sharing. Per-repo only (with optional global scope for personal preferences).
- Not "cheaper across the board" — only on code-gen tasks where the model converges.
Open questions
- Classifier — separate cheap call before the loop, or inline as a tool-choice nudge inside the main loop's first turn? Separate call is cleaner; inline saves one round-trip.
- Auto-promotion delay — 7 days? Per-N-uses? Configurable?
- Global scope semantics — purely personal preferences ("I prefer functional > class"), or repo-agnostic facts (e.g. "this user is in the China region, prefer DeepSeek defaults")?
- Misclassification cost — worst case from validation was ~5% extra tokens on a wrongly-classified task. Acceptable.
- Versioning gate — v0.19 or v0.20+?
Decision: deferred
This RFC documents what's been validated. Whether to ship is a separate prioritization call — the data supports the feature for the right task types, and the pitch is narrower than the original framing. Revisit alongside the rest of the v0.19 roadmap.
If we move forward, the build is ~3-5 days given how much of the infra already exists.
Status: Draft, gated. Validated 2026-05-05 against
deepseek-v4-flashon 6 real coding tasks. Whether to actually build is a separate call — see "Decision" at the end.Motivation
Every Reasonix session today re-discovers the same repo conventions: test runner, import suffix, file layout, tool-registration pattern, where common types live. Each rediscovery burns tool calls and tokens.
OpenClaw / Hermes use procedural memory (skill capture) for general-purpose agents. The interesting application for a coding agent is narrower: capture environmental facts about this specific repo so subsequent sessions skip rediscovery.
The original framing — "越用越便宜 (gets cheaper as you use it)" — turns out to be too broad. The honest framing is narrower but still real, and the validation data backs it.
Validation summary
6 paired tasks, 3 runs each, cold (no lessons in prompt) vs warm (lessons in prompt). Same model, same temperature, same tool set.
3/6 tasks reproducibly save 32%–69%, ~55% mean — all on tasks where the agent generates or modifies code that must match repo conventions, and the model can converge inside the turn budget.
The 3 negatives have clear, separate reasons:
v4-flashcan solve in 30 turns regardless of lessons. Lessons only add prompt overhead.Honest pitch
This is the bulk of real coding-agent work (write test, add function, modify implementation, fix bug, add tool). The pitch is narrower than the original framing, but it's reproducible and demoable side-by-side.
Design
Naming
lessons— distinct from existingskills(= playbooks like/explore,/review)./skillorrun_skill()/reviewruns the review playbookcode-gentasksStorage
Format mirrors existing
skills.tsfrontmatter parser:Capture (auto-extraction)
At session end, a reducer pass over
events.jsonl:deepseek-v4-flash,thinking: "disabled", ~600 max tokens) — same pattern as the existingharvest.ts..reasonix/lessons/_candidates/<name>.md.reasonix lessons promote <name>, OR auto-promote after 7 days with no rejection.Injection (selective, NOT always)
Critical decision from validation. Do NOT inject lessons unconditionally. T2/T3 (exploration) and T4 (cap-bound) showed lesson overhead with no upside.
Pre-flight classifier (single cheap call before the main loop starts):
code-gen|exploration|qa|othercode-genInjected lesson budget: cap at ~200 tokens. The validated
LESSONSwas 297 chars (~75 tokens) — that's the right order of magnitude.Fingerprint / staleness
Each lesson carries a
fingerprintlisting files/fields it depends on. On match:stale, skip injection, fall back to discovery, re-write the lesson with the new hash.reasonix lessons stalesurfaces drift.This is the load-bearing reliability mechanism: without it, the lesson library slowly turns into misinformation.
CLI surface
Architecture fit
The v0.14 event-log kernel covers ~80% of the infra:
events.jsonl+ reducersharvest.ts(typed-plan-state pattern)SkillStoreis reusabletelemetry/usage.tsNew code: ~3 modules (
lessons/store.ts,lessons/capture.ts,lessons/classifier.ts) + 1 CLI subcommand.Non-goals
skills. Distinct concept, parallel namespace.Open questions
Decision: deferred
This RFC documents what's been validated. Whether to ship is a separate prioritization call — the data supports the feature for the right task types, and the pitch is narrower than the original framing. Revisit alongside the rest of the v0.19 roadmap.
If we move forward, the build is ~3-5 days given how much of the infra already exists.