Skip to content

Keshab0310/agent-memory

Repository files navigation

agent-memory

Save 60-90% on LLM token costs with intelligent memory compression for multi-agent systems.

agent-memory compresses raw LLM tool output into structured observations, shares context across agents via a memory bus, and injects only relevant memory into each prompt — keeping your token budget under control.


The Problem

Running 5+ concurrent LLM agents burns tokens fast:

  • Each agent re-reads the same files, re-discovers the same context
  • Raw tool output (file reads, command results) consumes thousands of tokens
  • No shared memory means redundant API calls across agents
  • You hit rate limits and token budgets within minutes

The Solution

agent-memory sits between your agents and their context window:

Raw Tool Output (5,000 tokens)
  -> Observation Compression (500 tokens)
    -> Shared Memory Bus (SQLite + FTS5)
      -> Budget-Controlled Context Injection (8,000 token cap)

Tested results: 66-94% token savings, 3-74x compression ratio.


Quick Start

As a Python SDK

pip install agent-memory
from agent_memory import MemoryStore, ContextBuilder

# Initialize
memory = MemoryStore("./my_project.db")

# Store a compressed observation
memory.store_observation(Observation(
    agent_id="researcher-1",
    project="my-app",
    title="Found pagination bug in /users endpoint",
    narrative="The API returns 500 when page > 100 due to missing LIMIT clause",
    facts=["Max page size is 100", "No server-side validation"],
    concepts=["api", "bug", "pagination"],
))

# Build context for another agent (token-budgeted)
builder = ContextBuilder(memory)
context = builder.build(
    project="my-app",
    agent_id="coder-1",
    task_description="Fix the pagination bug",
)
# -> Returns compressed context within 8000 token budget
# -> Includes researcher-1's findings automatically

As a Claude Code Plugin

Step 1: Install

# Add the marketplace
claude plugin marketplace add Keshab0310/agent-memory

# Install the plugin
claude plugin install agent-memory@keshab-plugins

Verify it's installed:

claude plugin list
# Should show: agent-memory@keshab-plugins  v0.1.0  ✔ enabled

Step 2: Use It (It's Automatic)

Once installed, the plugin works silently in the background with zero configuration:

What happens automatically:

  • Every time Claude reads a file, runs a command, or edits code, the PostToolUse hook compresses that tool output into a structured observation and stores it locally
  • Every time you start a new session (or resume one), the SessionStart hook injects relevant past observations into Claude's context
  • Your plan is auto-detected (Pro/Max/API) and memory budgets adjust accordingly

You don't need to change how you use Claude Code. Just work normally — the plugin handles compression and recall behind the scenes.

Step 3: Search Past Work (MCP Tools)

The plugin exposes 4 MCP tools that Claude can use when you ask about past work:

You: "What did we find about the database schema yesterday?"
Claude: [uses memory_search tool] -> finds relevant observations
  -> "Yesterday we discovered the users table was missing an index
     on email, which caused the slow login query. We added a B-tree
     index and response time dropped from 2.3s to 45ms."
You: "How much have we saved on tokens?"
Claude: [uses memory_stats tool]
  -> "156 observations stored. Compression ratio: 18.2:1.
     Token savings: 94%. Estimated cost saved: $1.34."

The 4 tools available:

Tool What It Does When Claude Uses It
memory_search Searches past observations by keyword/topic When you ask "what did we find about X?"
memory_store Manually stores an observation When you say "remember this for later"
memory_stats Shows token savings dashboard When you ask about costs or savings
memory_context Builds a context summary for a task When starting complex multi-step work

Step 4: Check It's Working

After using Claude Code for a few tasks with the plugin installed:

You: "Show me my memory stats"
Claude: [uses memory_stats]
  Total observations: 23
  Compression ratio: 12.4:1
  Token savings: 91%

If you see observations being stored and savings > 0%, the plugin is working.

Uninstall

claude plugin uninstall agent-memory@keshab-plugins
claude plugin marketplace remove keshab-plugins

Plugin Data Location

All data is stored locally:

  • SQLite database: ~/.claude/plugins/data/agent-memory/memory.db
  • No data leaves your machine — see PRIVACY.md

Delete all plugin data:

rm -rf ~/.claude/plugins/data/agent-memory/

With Local LLMs (Ollama, LM Studio)

from agent_memory import LocalLLMAgent, AgentConfig, MemoryStore

memory = MemoryStore("./local.db")
agent = LocalLLMAgent(
    config=AgentConfig(agent_type="researcher", model="phi4:latest"),
    memory=memory,
    project="my-app",
    base_url="http://localhost:11434/v1",  # Ollama
)
result = agent.execute("What are the key design patterns in this codebase?")

How It Works

1. Observation Compression

Raw tool output (file reads, command results, API responses) gets compressed into structured observations:

[discovery] Found pagination bug in /users endpoint
  API returns 500 when page > 100 due to missing LIMIT clause
  - Max page size is 100
  - No server-side validation

A 5,000-token file read becomes a 200-token observation. That's a 25x compression ratio.

2. Shared Memory Bus

All agents write to and read from a shared memory store:

SQLite (structured queries) + FTS5 (full-text search)
  |
  +-- Optional: ChromaDB (semantic vector search)

Agent B sees what Agent A discovered — no re-querying needed.

3. Token-Budgeted Context Injection

Before each agent call, the ContextBuilder assembles a minimal context window:

Budget adapts to your plan automatically:

Plan Total Budget Own Obs Cross-Agent Why
Pro + Sonnet 8,000 4,000 2,400 Capped usage — stay lean
Pro + Opus 5,000 2,500 1,500 Opus burns limits fast — ultra-lean
Max + Sonnet 16,000 8,000 5,000 Unlimited — go wider
Max + Opus 50,000 25,000 18,000 Unlimited + 1M window — go deep
Local LLM 1,500 700 300 Small context windows

4. Prompt Caching (Anthropic)

Static content gets cache breakpoints for 90% input cost reduction:

System prompt     [CACHED - 10% cost]
Shared context    [CACHED - 10% cost]
Agent memory      [dynamic - full cost]
User message      [dynamic - full cost]

Real-World Use Cases

Use Case 1: Multi-File Refactoring Without Context Loss

The problem: You ask Claude Code to refactor authentication across 15 files. By file 8, it's forgotten the patterns established in files 1-3. It re-reads them, burning tokens. By file 12, you hit context limits.

How agent-memory solves it:

File 1-3: Claude reads and refactors auth code
  -> PostToolUse hook compresses each file read into an observation:
     [change] Refactored auth.py — replaced session tokens with JWT
     - New pattern: verify_jwt() middleware on all protected routes
     - Removed: legacy session_store dependency
     
File 4-15: Claude continues refactoring
  -> SessionStart hook injects compressed observations from files 1-3
  -> Claude sees the patterns (150 tokens) instead of re-reading files (5,000 tokens)
  -> 97% token savings on context recall

Before agent-memory: 15 files x 5,000 tokens each = 75,000 tokens re-read After agent-memory: 15 observations x 150 tokens = 2,250 tokens. 97% savings.


Use Case 2: Debugging Across Sessions

The problem: Yesterday you spent 2 hours debugging a race condition. You found the root cause, tried 3 approaches, and fixed it. Today, a related bug appears. Claude Code has zero memory of yesterday's work. You start from scratch.

How agent-memory solves it:

Yesterday's session (auto-captured by hooks):
  [discovery] Race condition in WebSocket handler
    - write_lock missing on shared_state dict
    - Reproduced with 5+ concurrent connections
    - Tried: asyncio.Lock (failed — wrong event loop)
    - Tried: threading.Lock (worked but caused deadlock in tests)
    - Fixed: threading.RLock with 5s timeout

Today's session:
  You: "There's another threading issue in the notification service"
  -> SessionStart hook injects yesterday's context automatically
  -> Claude sees the RLock pattern that worked
  -> Skips the 2 failed approaches
  -> Applies the proven fix in one shot

Without agent-memory: 45 minutes re-investigating the same threading patterns With agent-memory: 5 minutes — Claude already knows what works in your codebase


Use Case 3: Multi-Agent Research Pipeline

The problem: You spawn 5 agents to research, code, review, test, and document a feature. Each agent works in isolation. The coder doesn't know what the researcher found. The reviewer doesn't know what the coder tried and rejected.

How agent-memory solves it:

from src.profiles import detect_profile
from src.agents.base import Agent
from src.agents.registry import get_agent_config
from src.memory.store import MemoryStore

memory = MemoryStore("./shared.db")

# Agent 1: Researcher finds the best approach
researcher = Agent(get_agent_config("researcher"), memory, "my-project")
researcher.execute("Research OAuth2 vs API keys for our B2B API")
# -> Stores: [discovery] OAuth2 better for B2B — supports scopes, token rotation

# Agent 2: Coder sees the researcher's findings automatically
coder = Agent(get_agent_config("coder"), memory, "my-project")
coder.execute("Implement the auth system")
# -> ContextBuilder injects: "Researcher found OAuth2 is better for B2B..."
# -> Coder builds OAuth2 without asking "which auth method?"

# Agent 3: Reviewer sees BOTH researcher reasoning AND coder implementation
reviewer = Agent(get_agent_config("reviewer"), memory, "my-project")
reviewer.execute("Review the auth implementation")
# -> Sees researcher's OAuth2 rationale + coder's implementation decisions
# -> Reviews against the original requirements, not just code syntax

Without shared memory: Reviewer says "why not API keys?" — coder explains — wastes 2 round trips With shared memory: Reviewer already has context. Zero redundant conversation.


Use Case 4: Pro Plan Token Budget Optimization

The problem: You're on the Pro Plan using Opus 4.6. Adaptive thinking on "High" burns through your daily limit in 10 messages. Each message costs ~$0.50+ in tokens because the context window fills with raw tool output.

How agent-memory solves it:

Without agent-memory (Opus on Pro):
  Message 1: Read 3 files (15,000 tokens) + Opus thinking (25,000 tokens) = 40,000 tokens
  Message 2: Re-reads same files + new query = 45,000 tokens  
  Message 3: Context growing, Opus thinking harder = 60,000 tokens
  Total after 3 messages: 145,000 tokens. Daily limit: approaching fast.

With agent-memory (auto-detects opus-pro profile):
  Message 1: Read 3 files -> compressed to 3 observations (450 tokens)
             Opus thinking capped at 10,000 tokens = 25,000 total
  Message 2: Observations injected (450 tokens, not 15,000)
             New query + thinking = 18,000 total
  Message 3: 5,000 token memory budget, lean injection = 20,000 total
  Total after 3 messages: 63,000 tokens. 57% savings.

The plugin auto-detects your plan:

# No configuration needed — it reads your environment
from src.profiles import detect_profile

profile = detect_profile()  # Returns opus-pro automatically
# -> 5,000 token memory budget (not 50,000)
# -> Thinking capped at 10,000 tokens
# -> Aggressive condensation every 3 observations
# -> Your Pro Plan lasts 3x longer

Use Case 5: Onboarding to a New Codebase

The problem: You join a new team and need to understand a 500-file codebase. You ask Claude Code to explore it. After reading 20 files, the context is full of raw file contents, and Claude can't synthesize what it learned.

How agent-memory solves it:

Session 1: "Help me understand this codebase"
  Claude reads package.json, README, key source files
  -> Each file read compressed into observations:
     [discovery] FastAPI backend with SQLAlchemy ORM
       - 3-layer architecture: routers/ -> services/ -> models/
       - PostgreSQL with Alembic migrations
     [discovery] React frontend with Redux state
       - Component library in src/ui/
       - API calls centralized in src/api/client.ts
     [discovery] Auth uses JWT with refresh tokens
       - Tokens stored in httpOnly cookies
       - 15-min access token, 7-day refresh

Session 2 (next day): "Add a new API endpoint for user preferences"
  -> SessionStart hook injects Session 1 observations
  -> Claude already knows: FastAPI + SQLAlchemy + JWT auth + 3-layer pattern
  -> Immediately creates: models/preferences.py, services/preferences.py,
     routers/preferences.py following the existing pattern
  -> No re-exploration needed

Without agent-memory: Re-read 10+ files every session to rebuild context With agent-memory: Instant recall of codebase architecture in ~2,000 tokens


Use Case 6: Cost Monitoring Dashboard

The problem: You have no visibility into how many tokens your agents consume. You can't tell which agent is wasteful or whether your optimizations are working.

How agent-memory solves it:

from src.metrics.tracker import MetricsTracker
from src.memory.store import MemoryStore

memory = MemoryStore("./data/memory.db")
tracker = MetricsTracker(memory)
tracker.print_dashboard("my-project")

Output:

============================================================
MULTI-AGENT SYSTEM METRICS
============================================================
Total API calls:      47
Total tokens:         284,000
Cached tokens:        89,000
Cache hit rate:       31.3%
Compression ratio:    18.2:1
Token savings:        94%
Observations stored:  156
Avg latency:          3,200ms

Per-Agent Breakdown:
------------------------------------------------------------
Type            Calls    Cache%   Cost Ratio   Latency
coder           15       38.2%   0.68x       4,100ms
researcher      12       29.1%   0.74x       2,800ms
reviewer        10       33.5%   0.71x       2,400ms
planner          5       18.7%   0.85x       3,900ms
summarizer       5       44.2%   0.62x       1,200ms
============================================================

Or use the MCP tool directly in Claude Code:

You: "How much have we saved on tokens?"
Claude: Uses memory_stats tool
  -> "156 observations, 18.2:1 compression ratio, 94% token savings.
      Estimated savings: ~267,000 tokens ($1.34 at Sonnet pricing)."

Use Case 7: Local LLM Development (Zero API Cost)

The problem: You want to develop and test multi-agent workflows but don't want to burn API credits during prototyping.

How agent-memory solves it:

# Start Ollama
ollama serve

# Run the full 3-agent pipeline locally
python run_local.py --model phi4:latest
AGENT: RESEARCHER
  Task: Research token optimization strategies...
  Generating... done (114.3s)
  Observations: 1 [discovery] Three Core Optimization Strategies

AGENT: CODER  
  Task: Implement context builder...
  Context injected: researcher's findings (automatic)
  Generating... done (154.0s)
  Observations: 1 [feature] build_context Function

AGENT: REVIEWER
  Task: Review the implementation...
  Context injected: researcher + coder findings (automatic)
  Generating... done (90.2s)
  Observations: 1 [discovery] Greedy Algorithm Review

Token Economics:
  Compression: 3.0:1
  Savings: 66%
  Cost: $0.00

The entire memory pipeline (compression, shared memory, context injection) works identically with local models. When you're ready, switch to Anthropic with zero code changes:

# Local development
agent = LocalLLMAgent(config, memory, project, base_url="http://localhost:11434/v1")

# Production — just swap the class
agent = Agent(config, memory, project)  # Uses Anthropic API

Use Case 8: Plugin Auto-Adapts to Your Plan

The problem: You shouldn't have to configure anything. The plugin should just work optimally whether you're on Pro, Max, or using a local LLM.

How agent-memory solves it:

The plugin auto-detects your model and plan at startup:

Your Setup Auto-Detected Profile Memory Budget Thinking Cap
Pro + Sonnet (default) sonnet-pro 8,000 tokens none
Pro + Opus (/extra-usage) opus-pro 5,000 tokens 10,000
Max + Sonnet sonnet-max 16,000 tokens none
Max + Opus opus-max 50,000 tokens none
API key (direct) sonnet-api 8,000 tokens none
Ollama / LM Studio local 1,500 tokens none

No configuration files. No environment variables to set. It reads CLAUDE_MODEL, CLAUDE_CODE_MAX_PLAN, and ANTHROPIC_API_KEY from your environment and picks the optimal profile.

Override if needed:

# Force Max Plan profile (if auto-detection gets it wrong)
export AGENT_MEMORY_PLAN=max

Architecture

src/
  memory/
    store.py            # SQLite + ChromaDB + FTS5 memory layer
    context_builder.py  # Token-budgeted context injection
    condenser.py        # Periodic summarization pipeline
  agents/
    base.py             # Base agent with memory integration
    local_llm.py        # Ollama/LM Studio adapter
    registry.py         # Agent type definitions
  cache/
    prompt_cache.py     # Anthropic prompt caching wrapper
    rate_limiter.py     # Token bucket RPM/TPM limiter
  orchestrator/
    router.py           # DAG-based multi-agent task router
  metrics/
    tracker.py          # Token/cache/latency tracking
plugin/
  mcp_server.py         # MCP server for Claude Code
  hooks/
    post_tool_use.py    # Auto-compress tool output
    session_start.py    # Inject memory at session start
  plugin.json           # Claude Code plugin manifest
tests/
  test_memory.py        # Memory store + search + condensation tests
  test_orchestrator.py  # DAG execution + cache structure tests

Key Features

Feature Status
Observation compression (XML/auto-extract) Working
SQLite + FTS5 search Working
ChromaDB semantic search (optional) Working
Token-budgeted context injection Working
Anthropic prompt caching Working
Rate limiter (RPM + TPM) Working
Periodic condensation Working
Local LLM support (Ollama/LM Studio) Working
Multi-agent shared memory Working
DAG-based orchestrator Working
Metrics dashboard Working
Claude Code plugin (MCP + hooks) Beta

Running the Demos

Dry-run validation (no API key needed)

python run_demo.py

Live with Anthropic API

export ANTHROPIC_API_KEY=sk-ant-...
python run_demo.py --live

Local LLM (Ollama)

ollama serve  # In another terminal
python run_local.py --model phi4:latest

Testing

pip install -e ".[dev]"
pytest tests/ -v

All 14 tests cover: memory CRUD, agent isolation, condensation, semantic search, token budgets, cache breakpoints, DAG execution, and metrics logging.


Configuration

Memory Budget

from agent_memory import ContextBudget

budget = ContextBudget(
    total=8000,           # Total token budget for context
    task_description=800, # Reserved for task text
    own_observations=4000,# Agent's own recent work
    cross_agent=2400,     # Other agents' relevant findings
    summaries=800,        # Condensed history
)

Rate Limiting

from agent_memory.cache import RateLimiter

limiter = RateLimiter(
    requests_per_minute=50,
    tokens_per_minute=80_000,
)
limiter.acquire_sync(estimated_tokens=4000)  # Blocks until slot available

Agent Types

Built-in: researcher, coder, reviewer, summarizer, planner. Custom:

config = AgentConfig(
    agent_type="my-agent",
    model="claude-sonnet-4-6-20250514",
    max_output_tokens=2000,
    system_prompt="You are a specialized agent for...",
)

Benchmarks

Measured on real workloads (not synthetic):

Metric Result
Token savings (compression) 66-94%
Compression ratio 3:1 to 74:1
Prompt cache hit rate 23-35%
Cache cost reduction 0.71x
Context budget utilization 43% avg
Cross-agent memory sharing 100% (all agents see shared pool)

API Reference

See API_REFERENCE.md for the complete SDK documentation.


Contributing

  1. Fork the repo
  2. Create a feature branch
  3. Run tests: pytest tests/ -v
  4. Submit a PR

License

MIT - see LICENSE

About

Save 60-90% on LLM token costs with intelligent memory compression for multi-agent systems

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages