Skip to content

RFC: Procedural Lessons — auto-captured per-repo memory #261

@esengine

Description

@esengine

Status: Draft, gated. Validated 2026-05-05 against deepseek-v4-flash on 6 real coding tasks. Whether to actually build is a separate call — see "Decision" at the end.

Motivation

Every Reasonix session today re-discovers the same repo conventions: test runner, import suffix, file layout, tool-registration pattern, where common types live. Each rediscovery burns tool calls and tokens.

OpenClaw / Hermes use procedural memory (skill capture) for general-purpose agents. The interesting application for a coding agent is narrower: capture environmental facts about this specific repo so subsequent sessions skip rediscovery.

The original framing — "越用越便宜 (gets cheaper as you use it)" — turns out to be too broad. The honest framing is narrower but still real, and the validation data backs it.

Validation summary

6 paired tasks, 3 runs each, cold (no lessons in prompt) vs warm (lessons in prompt). Same model, same temperature, same tool set.

Task Reduction Win rate Notes
Write a unit test for one method -64% 3/3 Convention-heavy, simple scope
Create a new tool file matching pattern -69% 3/3 Pattern-matching against existing tools
Targeted bug fix in known function -32% 2/3 Clear scope
Add method to complex class +30% 0/3 Task ceiling above flash; lessons add overhead
Cross-file investigation ~0% Exploration-shaped, both modes cap-bound
Tool-addition planning ~0% Exploration-shaped, both modes cap-bound

3/6 tasks reproducibly save 32%–69%, ~55% mean — all on tasks where the agent generates or modifies code that must match repo conventions, and the model can converge inside the turn budget.

The 3 negatives have clear, separate reasons:

  • Add-method: task is harder than v4-flash can solve in 30 turns regardless of lessons. Lessons only add prompt overhead.
  • Cross-file / planning: exploration-shaped. Lesson hints don't help direction; the agent gets lost in tool calls regardless.

Honest pitch

"Reasonix gets cheaper as you use it (across the board)"

"Reasonix learns your repo. From the second session, tasks that write/modify code following your conventions cost 30–70% fewer tokens."

This is the bulk of real coding-agent work (write test, add function, modify implementation, fix bug, add tool). The pitch is narrower than the original framing, but it's reproducible and demoable side-by-side.

Design

Naming

lessons — distinct from existing skills (= playbooks like /explore, /review).

Concept Source Trigger Example
skill (existing) Pre-authored by user/builtin User invokes via /skill or run_skill() /review runs the review playbook
lesson (new) Auto-extracted from session history Auto-injected on classified code-gen tasks "tests live in tests/**/*.test.ts"

Storage

.reasonix/lessons/                   # project scope (committed)
  testing.md
  imports.md
  tools-registration.md
~/.reasonix/lessons/                 # global scope (personal, cross-repo)
  general-preferences.md

Format mirrors existing skills.ts frontmatter parser:

---
name: testing
description: Test runner + file convention
fingerprint:
  - package.json#scripts.test
  - tests/
captured-at: 2026-05-05
---
- vitest, tests/**/*.test.ts, run via `npm test`
- import style: .js suffix even for .ts source (ESM-NodeNext)

Capture (auto-extraction)

At session end, a reducer pass over events.jsonl:

  1. Identify successful turn patterns: tool sequence → final answer accepted by user (or session completed without rejection).
  2. Use a cheap extraction call (deepseek-v4-flash, thinking: "disabled", ~600 max tokens) — same pattern as the existing harvest.ts.
  3. Output candidate lesson to .reasonix/lessons/_candidates/<name>.md.
  4. Promotion path: user runs reasonix lessons promote <name>, OR auto-promote after 7 days with no rejection.

Injection (selective, NOT always)

Critical decision from validation. Do NOT inject lessons unconditionally. T2/T3 (exploration) and T4 (cap-bound) showed lesson overhead with no upside.

Pre-flight classifier (single cheap call before the main loop starts):

  • Read user request
  • Classify into one of: code-gen | exploration | qa | other
  • Inject lessons ONLY for code-gen
  • If classifier is unsure → default to NOT inject (the failure cost of false-skip is small; the cost of false-inject on an exploration task is wasted prompt budget every turn)

Injected lesson budget: cap at ~200 tokens. The validated LESSONS was 297 chars (~75 tokens) — that's the right order of magnitude.

Fingerprint / staleness

Each lesson carries a fingerprint listing files/fields it depends on. On match:

  1. Hash fingerprint inputs at injection time.
  2. If hash diverges → mark lesson stale, skip injection, fall back to discovery, re-write the lesson with the new hash.
  3. reasonix lessons stale surfaces drift.

This is the load-bearing reliability mechanism: without it, the lesson library slowly turns into misinformation.

CLI surface

reasonix lessons list              # all lessons across scopes
reasonix lessons show <name>
reasonix lessons promote <name>    # candidate → real
reasonix lessons rm <name>
reasonix lessons stale             # fingerprint mismatches

Architecture fit

The v0.14 event-log kernel covers ~80% of the infra:

Need Existing piece
Event capture events.jsonl + reducers
Cheap extraction harvest.ts (typed-plan-state pattern)
Frontmatter + per-scope storage SkillStore is reusable
Token-savings UI telemetry/usage.ts

New code: ~3 modules (lessons/store.ts, lessons/capture.ts, lessons/classifier.ts) + 1 CLI subcommand.

Non-goals

  • Not a replacement for skills. Distinct concept, parallel namespace.
  • Not always-on injection. Lessons must justify their prompt cost per task type.
  • Not cross-user lesson sharing. Per-repo only (with optional global scope for personal preferences).
  • Not "cheaper across the board" — only on code-gen tasks where the model converges.

Open questions

  1. Classifier — separate cheap call before the loop, or inline as a tool-choice nudge inside the main loop's first turn? Separate call is cleaner; inline saves one round-trip.
  2. Auto-promotion delay — 7 days? Per-N-uses? Configurable?
  3. Global scope semantics — purely personal preferences ("I prefer functional > class"), or repo-agnostic facts (e.g. "this user is in the China region, prefer DeepSeek defaults")?
  4. Misclassification cost — worst case from validation was ~5% extra tokens on a wrongly-classified task. Acceptable.
  5. Versioning gate — v0.19 or v0.20+?

Decision: deferred

This RFC documents what's been validated. Whether to ship is a separate prioritization call — the data supports the feature for the right task types, and the pitch is narrower than the original framing. Revisit alongside the rest of the v0.19 roadmap.

If we move forward, the build is ~3-5 days given how much of the infra already exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    rfcArchitecture proposal / request for comments

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions