RFC: Procedural Lessons — auto-captured per-repo memory

**Status:** Draft, gated. Validated 2026-05-05 against `deepseek-v4-flash` on 6 real coding tasks. Whether to actually build is a separate call — see "Decision" at the end.

## Motivation

Every Reasonix session today re-discovers the same repo conventions: test runner, import suffix, file layout, tool-registration pattern, where common types live. Each rediscovery burns tool calls and tokens.

OpenClaw / Hermes use *procedural memory* (skill capture) for general-purpose agents. The interesting application for a coding agent is narrower: capture **environmental facts about this specific repo** so subsequent sessions skip rediscovery.

The original framing — "越用越便宜 (gets cheaper as you use it)" — turns out to be too broad. The honest framing is narrower but still real, and the validation data backs it.

## Validation summary

6 paired tasks, 3 runs each, cold (no lessons in prompt) vs warm (lessons in prompt). Same model, same temperature, same tool set.

| Task | Reduction | Win rate | Notes |
|---|---|---|---|
| Write a unit test for one method | **-64%** | 3/3 | Convention-heavy, simple scope |
| Create a new tool file matching pattern | **-69%** | 3/3 | Pattern-matching against existing tools |
| Targeted bug fix in known function | **-32%** | 2/3 | Clear scope |
| Add method to complex class | +30% | 0/3 | Task ceiling above flash; lessons add overhead |
| Cross-file investigation | ~0% | — | Exploration-shaped, both modes cap-bound |
| Tool-addition planning | ~0% | — | Exploration-shaped, both modes cap-bound |

**3/6 tasks reproducibly save 32%–69%, ~55% mean** — all on tasks where the agent generates or modifies code that must match repo conventions, and the model can converge inside the turn budget.

The 3 negatives have clear, separate reasons:
- **Add-method**: task is harder than `v4-flash` can solve in 30 turns regardless of lessons. Lessons only add prompt overhead.
- **Cross-file / planning**: exploration-shaped. Lesson hints don't help direction; the agent gets lost in tool calls regardless.

### Honest pitch

> ❌ ~~"Reasonix gets cheaper as you use it (across the board)"~~
>
> ✅ **"Reasonix learns your repo. From the second session, tasks that write/modify code following your conventions cost 30–70% fewer tokens."**

This is the bulk of real coding-agent work (write test, add function, modify implementation, fix bug, add tool). The pitch is narrower than the original framing, but it's reproducible and demoable side-by-side.

## Design

### Naming

`lessons` — distinct from existing `skills` (= playbooks like `/explore`, `/review`).

| Concept | Source | Trigger | Example |
|---|---|---|---|
| **skill** (existing) | Pre-authored by user/builtin | User invokes via `/skill` or `run_skill()` | `/review` runs the review playbook |
| **lesson** (new) | Auto-extracted from session history | Auto-injected on classified `code-gen` tasks | "tests live in tests/**/*.test.ts" |

### Storage

```
.reasonix/lessons/                   # project scope (committed)
  testing.md
  imports.md
  tools-registration.md
~/.reasonix/lessons/                 # global scope (personal, cross-repo)
  general-preferences.md
```

Format mirrors existing `skills.ts` frontmatter parser:

```yaml
---
name: testing
description: Test runner + file convention
fingerprint:
  - package.json#scripts.test
  - tests/
captured-at: 2026-05-05
---
- vitest, tests/**/*.test.ts, run via `npm test`
- import style: .js suffix even for .ts source (ESM-NodeNext)
```

### Capture (auto-extraction)

At session end, a reducer pass over `events.jsonl`:

1. Identify successful turn patterns: tool sequence → final answer accepted by user (or session completed without rejection).
2. Use a cheap extraction call (`deepseek-v4-flash`, `thinking: "disabled"`, ~600 max tokens) — same pattern as the existing `harvest.ts`.
3. Output candidate lesson to `.reasonix/lessons/_candidates/<name>.md`.
4. Promotion path: user runs `reasonix lessons promote <name>`, OR auto-promote after 7 days with no rejection.

### Injection (selective, NOT always)

**Critical decision from validation.** Do NOT inject lessons unconditionally. T2/T3 (exploration) and T4 (cap-bound) showed lesson overhead with no upside.

Pre-flight classifier (single cheap call before the main loop starts):
- Read user request
- Classify into one of: `code-gen` | `exploration` | `qa` | `other`
- Inject lessons ONLY for `code-gen`
- If classifier is unsure → default to NOT inject (the failure cost of false-skip is small; the cost of false-inject on an exploration task is wasted prompt budget every turn)

Injected lesson budget: cap at ~200 tokens. The validated `LESSONS` was 297 chars (~75 tokens) — that's the right order of magnitude.

### Fingerprint / staleness

Each lesson carries a `fingerprint` listing files/fields it depends on. On match:

1. Hash fingerprint inputs at injection time.
2. If hash diverges → mark lesson `stale`, skip injection, fall back to discovery, re-write the lesson with the new hash.
3. `reasonix lessons stale` surfaces drift.

This is the load-bearing reliability mechanism: without it, the lesson library slowly turns into misinformation.

### CLI surface

```
reasonix lessons list              # all lessons across scopes
reasonix lessons show <name>
reasonix lessons promote <name>    # candidate → real
reasonix lessons rm <name>
reasonix lessons stale             # fingerprint mismatches
```

### Architecture fit

The v0.14 event-log kernel covers ~80% of the infra:

| Need | Existing piece |
|---|---|
| Event capture | `events.jsonl` + reducers |
| Cheap extraction | `harvest.ts` (typed-plan-state pattern) |
| Frontmatter + per-scope storage | `SkillStore` is reusable |
| Token-savings UI | `telemetry/usage.ts` |

New code: ~3 modules (`lessons/store.ts`, `lessons/capture.ts`, `lessons/classifier.ts`) + 1 CLI subcommand.

## Non-goals

- Not a replacement for `skills`. Distinct concept, parallel namespace.
- Not always-on injection. Lessons must justify their prompt cost per task type.
- Not cross-user lesson sharing. Per-repo only (with optional global scope for personal preferences).
- Not "cheaper across the board" — only on code-gen tasks where the model converges.

## Open questions

1. **Classifier** — separate cheap call before the loop, or inline as a tool-choice nudge inside the main loop's first turn? Separate call is cleaner; inline saves one round-trip.
2. **Auto-promotion delay** — 7 days? Per-N-uses? Configurable?
3. **Global scope semantics** — purely personal preferences ("I prefer functional > class"), or repo-agnostic facts (e.g. "this user is in the China region, prefer DeepSeek defaults")?
4. **Misclassification cost** — worst case from validation was ~5% extra tokens on a wrongly-classified task. Acceptable.
5. **Versioning gate** — v0.19 or v0.20+?

## Decision: deferred

This RFC documents what's been validated. Whether to ship is a separate prioritization call — the data supports the feature for the right task types, and the pitch is narrower than the original framing. Revisit alongside the rest of the v0.19 roadmap.

If we move forward, the build is ~3-5 days given how much of the infra already exists.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Procedural Lessons — auto-captured per-repo memory #261

Motivation

Validation summary

Honest pitch

Design

Naming

Storage

Capture (auto-extraction)

Injection (selective, NOT always)

Fingerprint / staleness

CLI surface

Architecture fit

Non-goals

Open questions

Decision: deferred

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Task	Reduction	Win rate	Notes
Write a unit test for one method	-64%	3/3	Convention-heavy, simple scope
Create a new tool file matching pattern	-69%	3/3	Pattern-matching against existing tools
Targeted bug fix in known function	-32%	2/3	Clear scope
Add method to complex class	+30%	0/3	Task ceiling above flash; lessons add overhead
Cross-file investigation	~0%	—	Exploration-shaped, both modes cap-bound
Tool-addition planning	~0%	—	Exploration-shaped, both modes cap-bound

Concept	Source	Trigger	Example
skill (existing)	Pre-authored by user/builtin	User invokes via `/skill` or `run_skill()`	`/review` runs the review playbook
lesson (new)	Auto-extracted from session history	Auto-injected on classified `code-gen` tasks	"tests live in tests/*/.test.ts"

Need	Existing piece
Event capture	`events.jsonl` + reducers
Cheap extraction	`harvest.ts` (typed-plan-state pattern)
Frontmatter + per-scope storage	`SkillStore` is reusable
Token-savings UI	`telemetry/usage.ts`

RFC: Procedural Lessons — auto-captured per-repo memory #261

Description

Motivation

Validation summary

Honest pitch

Design

Naming

Storage

Capture (auto-extraction)

Injection (selective, NOT always)

Fingerprint / staleness

CLI surface

Architecture fit

Non-goals

Open questions

Decision: deferred

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions