agent-memory

Save 60-90% on LLM token costs with intelligent memory compression for multi-agent systems.

agent-memory compresses raw LLM tool output into structured observations, shares context across agents via a memory bus, and injects only relevant memory into each prompt — keeping your token budget under control.

The Problem

Running 5+ concurrent LLM agents burns tokens fast:

Each agent re-reads the same files, re-discovers the same context
Raw tool output (file reads, command results) consumes thousands of tokens
No shared memory means redundant API calls across agents
You hit rate limits and token budgets within minutes

The Solution

agent-memory sits between your agents and their context window:

Raw Tool Output (5,000 tokens)
  -> Observation Compression (500 tokens)
    -> Shared Memory Bus (SQLite + FTS5)
      -> Budget-Controlled Context Injection (8,000 token cap)

Tested results: 66-94% token savings, 3-74x compression ratio.

Quick Start

As a Python SDK

pip install agent-memory

from agent_memory import MemoryStore, ContextBuilder

# Initialize
memory = MemoryStore("./my_project.db")

# Store a compressed observation
memory.store_observation(Observation(
    agent_id="researcher-1",
    project="my-app",
    title="Found pagination bug in /users endpoint",
    narrative="The API returns 500 when page > 100 due to missing LIMIT clause",
    facts=["Max page size is 100", "No server-side validation"],
    concepts=["api", "bug", "pagination"],
))

# Build context for another agent (token-budgeted)
builder = ContextBuilder(memory)
context = builder.build(
    project="my-app",
    agent_id="coder-1",
    task_description="Fix the pagination bug",
)
# -> Returns compressed context within 8000 token budget
# -> Includes researcher-1's findings automatically

As a Claude Code Plugin

Step 1: Install

# Add the marketplace
claude plugin marketplace add Keshab0310/agent-memory

# Install the plugin
claude plugin install agent-memory@keshab-plugins

Verify it's installed:

claude plugin list
# Should show: agent-memory@keshab-plugins  v0.1.0  ✔ enabled

Step 2: Use It (It's Automatic)

Once installed, the plugin works silently in the background with zero configuration:

What happens automatically:

Every time Claude reads a file, runs a command, or edits code, the PostToolUse hook compresses that tool output into a structured observation and stores it locally
Every time you start a new session (or resume one), the SessionStart hook injects relevant past observations into Claude's context
Your plan is auto-detected (Pro/Max/API) and memory budgets adjust accordingly

You don't need to change how you use Claude Code. Just work normally — the plugin handles compression and recall behind the scenes.

Step 3: Search Past Work (MCP Tools)

The plugin exposes 4 MCP tools that Claude can use when you ask about past work:

You: "What did we find about the database schema yesterday?"
Claude: [uses memory_search tool] -> finds relevant observations
  -> "Yesterday we discovered the users table was missing an index
     on email, which caused the slow login query. We added a B-tree
     index and response time dropped from 2.3s to 45ms."

You: "How much have we saved on tokens?"
Claude: [uses memory_stats tool]
  -> "156 observations stored. Compression ratio: 18.2:1.
     Token savings: 94%. Estimated cost saved: $1.34."

The 4 tools available:

Tool	What It Does	When Claude Uses It
`memory_search`	Searches past observations by keyword/topic	When you ask "what did we find about X?"
`memory_store`	Manually stores an observation	When you say "remember this for later"
`memory_stats`	Shows token savings dashboard	When you ask about costs or savings
`memory_context`	Builds a context summary for a task	When starting complex multi-step work

Step 4: Check It's Working

After using Claude Code for a few tasks with the plugin installed:

You: "Show me my memory stats"
Claude: [uses memory_stats]
  Total observations: 23
  Compression ratio: 12.4:1
  Token savings: 91%

If you see observations being stored and savings > 0%, the plugin is working.

Uninstall

claude plugin uninstall agent-memory@keshab-plugins
claude plugin marketplace remove keshab-plugins

Plugin Data Location

All data is stored locally:

SQLite database: ~/.claude/plugins/data/agent-memory/memory.db
No data leaves your machine — see PRIVACY.md

Delete all plugin data:

rm -rf ~/.claude/plugins/data/agent-memory/

With Local LLMs (Ollama, LM Studio)

from agent_memory import LocalLLMAgent, AgentConfig, MemoryStore

memory = MemoryStore("./local.db")
agent = LocalLLMAgent(
    config=AgentConfig(agent_type="researcher", model="phi4:latest"),
    memory=memory,
    project="my-app",
    base_url="http://localhost:11434/v1",  # Ollama
)
result = agent.execute("What are the key design patterns in this codebase?")

How It Works

1. Observation Compression

Raw tool output (file reads, command results, API responses) gets compressed into structured observations:

[discovery] Found pagination bug in /users endpoint
  API returns 500 when page > 100 due to missing LIMIT clause
  - Max page size is 100
  - No server-side validation

A 5,000-token file read becomes a 200-token observation. That's a 25x compression ratio.

2. Shared Memory Bus

All agents write to and read from a shared memory store:

SQLite (structured queries) + FTS5 (full-text search)
  |
  +-- Optional: ChromaDB (semantic vector search)

Agent B sees what Agent A discovered — no re-querying needed.

3. Token-Budgeted Context Injection

Before each agent call, the ContextBuilder assembles a minimal context window:

Budget adapts to your plan automatically:

Plan	Total Budget	Own Obs	Cross-Agent	Why
Pro + Sonnet	8,000	4,000	2,400	Capped usage — stay lean
Pro + Opus	5,000	2,500	1,500	Opus burns limits fast — ultra-lean
Max + Sonnet	16,000	8,000	5,000	Unlimited — go wider
Max + Opus	50,000	25,000	18,000	Unlimited + 1M window — go deep
Local LLM	1,500	700	300	Small context windows

4. Prompt Caching (Anthropic)

Static content gets cache breakpoints for 90% input cost reduction:

System prompt     [CACHED - 10% cost]
Shared context    [CACHED - 10% cost]
Agent memory      [dynamic - full cost]
User message      [dynamic - full cost]

Real-World Use Cases

Use Case 1: Multi-File Refactoring Without Context Loss

The problem: You ask Claude Code to refactor authentication across 15 files. By file 8, it's forgotten the patterns established in files 1-3. It re-reads them, burning tokens. By file 12, you hit context limits.

How agent-memory solves it:

File 1-3: Claude reads and refactors auth code
  -> PostToolUse hook compresses each file read into an observation:
     [change] Refactored auth.py — replaced session tokens with JWT
     - New pattern: verify_jwt() middleware on all protected routes
     - Removed: legacy session_store dependency
     
File 4-15: Claude continues refactoring
  -> SessionStart hook injects compressed observations from files 1-3
  -> Claude sees the patterns (150 tokens) instead of re-reading files (5,000 tokens)
  -> 97% token savings on context recall

Before agent-memory: 15 files x 5,000 tokens each = 75,000 tokens re-read After agent-memory: 15 observations x 150 tokens = 2,250 tokens. 97% savings.

Use Case 2: Debugging Across Sessions

The problem: Yesterday you spent 2 hours debugging a race condition. You found the root cause, tried 3 approaches, and fixed it. Today, a related bug appears. Claude Code has zero memory of yesterday's work. You start from scratch.

How agent-memory solves it:

Yesterday's session (auto-captured by hooks):
  [discovery] Race condition in WebSocket handler
    - write_lock missing on shared_state dict
    - Reproduced with 5+ concurrent connections
    - Tried: asyncio.Lock (failed — wrong event loop)
    - Tried: threading.Lock (worked but caused deadlock in tests)
    - Fixed: threading.RLock with 5s timeout

Today's session:
  You: "There's another threading issue in the notification service"
  -> SessionStart hook injects yesterday's context automatically
  -> Claude sees the RLock pattern that worked
  -> Skips the 2 failed approaches
  -> Applies the proven fix in one shot

Without agent-memory: 45 minutes re-investigating the same threading patterns With agent-memory: 5 minutes — Claude already knows what works in your codebase

Use Case 3: Multi-Agent Research Pipeline

The problem: You spawn 5 agents to research, code, review, test, and document a feature. Each agent works in isolation. The coder doesn't know what the researcher found. The reviewer doesn't know what the coder tried and rejected.

How agent-memory solves it:

from src.profiles import detect_profile
from src.agents.base import Agent
from src.agents.registry import get_agent_config
from src.memory.store import MemoryStore

memory = MemoryStore("./shared.db")

# Agent 1: Researcher finds the best approach
researcher = Agent(get_agent_config("researcher"), memory, "my-project")
researcher.execute("Research OAuth2 vs API keys for our B2B API")
# -> Stores: [discovery] OAuth2 better for B2B — supports scopes, token rotation

# Agent 2: Coder sees the researcher's findings automatically
coder = Agent(get_agent_config("coder"), memory, "my-project")
coder.execute("Implement the auth system")
# -> ContextBuilder injects: "Researcher found OAuth2 is better for B2B..."
# -> Coder builds OAuth2 without asking "which auth method?"

# Agent 3: Reviewer sees BOTH researcher reasoning AND coder implementation
reviewer = Agent(get_agent_config("reviewer"), memory, "my-project")
reviewer.execute("Review the auth implementation")
# -> Sees researcher's OAuth2 rationale + coder's implementation decisions
# -> Reviews against the original requirements, not just code syntax

Without shared memory: Reviewer says "why not API keys?" — coder explains — wastes 2 round trips With shared memory: Reviewer already has context. Zero redundant conversation.

Use Case 4: Pro Plan Token Budget Optimization

The problem: You're on the Pro Plan using Opus 4.6. Adaptive thinking on "High" burns through your daily limit in 10 messages. Each message costs ~$0.50+ in tokens because the context window fills with raw tool output.

How agent-memory solves it:

Without agent-memory (Opus on Pro):
  Message 1: Read 3 files (15,000 tokens) + Opus thinking (25,000 tokens) = 40,000 tokens
  Message 2: Re-reads same files + new query = 45,000 tokens  
  Message 3: Context growing, Opus thinking harder = 60,000 tokens
  Total after 3 messages: 145,000 tokens. Daily limit: approaching fast.

With agent-memory (auto-detects opus-pro profile):
  Message 1: Read 3 files -> compressed to 3 observations (450 tokens)
             Opus thinking capped at 10,000 tokens = 25,000 total
  Message 2: Observations injected (450 tokens, not 15,000)
             New query + thinking = 18,000 total
  Message 3: 5,000 token memory budget, lean injection = 20,000 total
  Total after 3 messages: 63,000 tokens. 57% savings.

The plugin auto-detects your plan:

# No configuration needed — it reads your environment
from src.profiles import detect_profile

profile = detect_profile()  # Returns opus-pro automatically
# -> 5,000 token memory budget (not 50,000)
# -> Thinking capped at 10,000 tokens
# -> Aggressive condensation every 3 observations
# -> Your Pro Plan lasts 3x longer

Use Case 5: Onboarding to a New Codebase

The problem: You join a new team and need to understand a 500-file codebase. You ask Claude Code to explore it. After reading 20 files, the context is full of raw file contents, and Claude can't synthesize what it learned.

How agent-memory solves it:

Session 1: "Help me understand this codebase"
  Claude reads package.json, README, key source files
  -> Each file read compressed into observations:
     [discovery] FastAPI backend with SQLAlchemy ORM
       - 3-layer architecture: routers/ -> services/ -> models/
       - PostgreSQL with Alembic migrations
     [discovery] React frontend with Redux state
       - Component library in src/ui/
       - API calls centralized in src/api/client.ts
     [discovery] Auth uses JWT with refresh tokens
       - Tokens stored in httpOnly cookies
       - 15-min access token, 7-day refresh

Session 2 (next day): "Add a new API endpoint for user preferences"
  -> SessionStart hook injects Session 1 observations
  -> Claude already knows: FastAPI + SQLAlchemy + JWT auth + 3-layer pattern
  -> Immediately creates: models/preferences.py, services/preferences.py,
     routers/preferences.py following the existing pattern
  -> No re-exploration needed

Without agent-memory: Re-read 10+ files every session to rebuild context With agent-memory: Instant recall of codebase architecture in ~2,000 tokens

Use Case 6: Cost Monitoring Dashboard

The problem: You have no visibility into how many tokens your agents consume. You can't tell which agent is wasteful or whether your optimizations are working.

How agent-memory solves it:

from src.metrics.tracker import MetricsTracker
from src.memory.store import MemoryStore

memory = MemoryStore("./data/memory.db")
tracker = MetricsTracker(memory)
tracker.print_dashboard("my-project")

Output:

============================================================
MULTI-AGENT SYSTEM METRICS
============================================================
Total API calls:      47
Total tokens:         284,000
Cached tokens:        89,000
Cache hit rate:       31.3%
Compression ratio:    18.2:1
Token savings:        94%
Observations stored:  156
Avg latency:          3,200ms

Per-Agent Breakdown:
------------------------------------------------------------
Type            Calls    Cache%   Cost Ratio   Latency
coder           15       38.2%   0.68x       4,100ms
researcher      12       29.1%   0.74x       2,800ms
reviewer        10       33.5%   0.71x       2,400ms
planner          5       18.7%   0.85x       3,900ms
summarizer       5       44.2%   0.62x       1,200ms
============================================================

Or use the MCP tool directly in Claude Code:

You: "How much have we saved on tokens?"
Claude: Uses memory_stats tool
  -> "156 observations, 18.2:1 compression ratio, 94% token savings.
      Estimated savings: ~267,000 tokens ($1.34 at Sonnet pricing)."

Use Case 7: Local LLM Development (Zero API Cost)

The problem: You want to develop and test multi-agent workflows but don't want to burn API credits during prototyping.

How agent-memory solves it:

# Start Ollama
ollama serve

# Run the full 3-agent pipeline locally
python run_local.py --model phi4:latest

AGENT: RESEARCHER
  Task: Research token optimization strategies...
  Generating... done (114.3s)
  Observations: 1 [discovery] Three Core Optimization Strategies

AGENT: CODER  
  Task: Implement context builder...
  Context injected: researcher's findings (automatic)
  Generating... done (154.0s)
  Observations: 1 [feature] build_context Function

AGENT: REVIEWER
  Task: Review the implementation...
  Context injected: researcher + coder findings (automatic)
  Generating... done (90.2s)
  Observations: 1 [discovery] Greedy Algorithm Review

Token Economics:
  Compression: 3.0:1
  Savings: 66%
  Cost: $0.00

The entire memory pipeline (compression, shared memory, context injection) works identically with local models. When you're ready, switch to Anthropic with zero code changes:

# Local development
agent = LocalLLMAgent(config, memory, project, base_url="http://localhost:11434/v1")

# Production — just swap the class
agent = Agent(config, memory, project)  # Uses Anthropic API

Use Case 8: Plugin Auto-Adapts to Your Plan

The problem: You shouldn't have to configure anything. The plugin should just work optimally whether you're on Pro, Max, or using a local LLM.

How agent-memory solves it:

The plugin auto-detects your model and plan at startup:

Your Setup	Auto-Detected Profile	Memory Budget	Thinking Cap
Pro + Sonnet (default)	`sonnet-pro`	8,000 tokens	none
Pro + Opus (/extra-usage)	`opus-pro`	5,000 tokens	10,000
Max + Sonnet	`sonnet-max`	16,000 tokens	none
Max + Opus	`opus-max`	50,000 tokens	none
API key (direct)	`sonnet-api`	8,000 tokens	none
Ollama / LM Studio	`local`	1,500 tokens	none

No configuration files. No environment variables to set. It reads CLAUDE_MODEL, CLAUDE_CODE_MAX_PLAN, and ANTHROPIC_API_KEY from your environment and picks the optimal profile.

Override if needed:

# Force Max Plan profile (if auto-detection gets it wrong)
export AGENT_MEMORY_PLAN=max

Architecture

src/
  memory/
    store.py            # SQLite + ChromaDB + FTS5 memory layer
    context_builder.py  # Token-budgeted context injection
    condenser.py        # Periodic summarization pipeline
  agents/
    base.py             # Base agent with memory integration
    local_llm.py        # Ollama/LM Studio adapter
    registry.py         # Agent type definitions
  cache/
    prompt_cache.py     # Anthropic prompt caching wrapper
    rate_limiter.py     # Token bucket RPM/TPM limiter
  orchestrator/
    router.py           # DAG-based multi-agent task router
  metrics/
    tracker.py          # Token/cache/latency tracking
plugin/
  mcp_server.py         # MCP server for Claude Code
  hooks/
    post_tool_use.py    # Auto-compress tool output
    session_start.py    # Inject memory at session start
  plugin.json           # Claude Code plugin manifest
tests/
  test_memory.py        # Memory store + search + condensation tests
  test_orchestrator.py  # DAG execution + cache structure tests

Key Features

Feature	Status
Observation compression (XML/auto-extract)	Working
SQLite + FTS5 search	Working
ChromaDB semantic search (optional)	Working
Token-budgeted context injection	Working
Anthropic prompt caching	Working
Rate limiter (RPM + TPM)	Working
Periodic condensation	Working
Local LLM support (Ollama/LM Studio)	Working
Multi-agent shared memory	Working
DAG-based orchestrator	Working
Metrics dashboard	Working
Claude Code plugin (MCP + hooks)	Beta

Running the Demos

Dry-run validation (no API key needed)

python run_demo.py

Live with Anthropic API

export ANTHROPIC_API_KEY=sk-ant-...
python run_demo.py --live

Local LLM (Ollama)

ollama serve  # In another terminal
python run_local.py --model phi4:latest

Testing

pip install -e ".[dev]"
pytest tests/ -v

All 14 tests cover: memory CRUD, agent isolation, condensation, semantic search, token budgets, cache breakpoints, DAG execution, and metrics logging.

Configuration

Memory Budget

from agent_memory import ContextBudget

budget = ContextBudget(
    total=8000,           # Total token budget for context
    task_description=800, # Reserved for task text
    own_observations=4000,# Agent's own recent work
    cross_agent=2400,     # Other agents' relevant findings
    summaries=800,        # Condensed history
)

Rate Limiting

from agent_memory.cache import RateLimiter

limiter = RateLimiter(
    requests_per_minute=50,
    tokens_per_minute=80_000,
)
limiter.acquire_sync(estimated_tokens=4000)  # Blocks until slot available

Agent Types

Built-in: researcher, coder, reviewer, summarizer, planner. Custom:

config = AgentConfig(
    agent_type="my-agent",
    model="claude-sonnet-4-6-20250514",
    max_output_tokens=2000,
    system_prompt="You are a specialized agent for...",
)

Benchmarks

Measured on real workloads (not synthetic):

Metric	Result
Token savings (compression)	66-94%
Compression ratio	3:1 to 74:1
Prompt cache hit rate	23-35%
Cache cost reduction	0.71x
Context budget utilization	43% avg
Cross-agent memory sharing	100% (all agents see shared pool)

API Reference

See API_REFERENCE.md for the complete SDK documentation.

Contributing

Fork the repo
Create a feature branch
Run tests: pytest tests/ -v
Submit a PR

License

MIT - see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.claude-plugin		.claude-plugin
hooks		hooks
plugin		plugin
skills/memory-search		skills/memory-search
src		src
tests		tests
.gitignore		.gitignore
.mcp.json		.mcp.json
API_REFERENCE.md		API_REFERENCE.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
PRIVACY.md		PRIVACY.md
PRODUCT_STRATEGY.md		PRODUCT_STRATEGY.md
PUSH_SCHEDULE.md		PUSH_SCHEDULE.md
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_demo.py		run_demo.py
run_local.py		run_local.py
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

agent-memory

The Problem

The Solution

Quick Start

As a Python SDK

As a Claude Code Plugin

Step 1: Install

Step 2: Use It (It's Automatic)

Step 3: Search Past Work (MCP Tools)

Step 4: Check It's Working

Uninstall

Plugin Data Location

With Local LLMs (Ollama, LM Studio)

How It Works

1. Observation Compression

2. Shared Memory Bus

3. Token-Budgeted Context Injection

4. Prompt Caching (Anthropic)

Real-World Use Cases

Use Case 1: Multi-File Refactoring Without Context Loss

Use Case 2: Debugging Across Sessions

Use Case 3: Multi-Agent Research Pipeline

Use Case 4: Pro Plan Token Budget Optimization

Use Case 5: Onboarding to a New Codebase

Use Case 6: Cost Monitoring Dashboard

Use Case 7: Local LLM Development (Zero API Cost)

Use Case 8: Plugin Auto-Adapts to Your Plan

Architecture

Key Features

Running the Demos

Dry-run validation (no API key needed)

Live with Anthropic API

Local LLM (Ollama)

Testing

Configuration

Memory Budget

Rate Limiting

Agent Types

Benchmarks

API Reference

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages