Save 60-90% on LLM token costs with intelligent memory compression for multi-agent systems.
agent-memory compresses raw LLM tool output into structured observations, shares context across agents via a memory bus, and injects only relevant memory into each prompt — keeping your token budget under control.
Running 5+ concurrent LLM agents burns tokens fast:
- Each agent re-reads the same files, re-discovers the same context
- Raw tool output (file reads, command results) consumes thousands of tokens
- No shared memory means redundant API calls across agents
- You hit rate limits and token budgets within minutes
agent-memory sits between your agents and their context window:
Raw Tool Output (5,000 tokens)
-> Observation Compression (500 tokens)
-> Shared Memory Bus (SQLite + FTS5)
-> Budget-Controlled Context Injection (8,000 token cap)
Tested results: 66-94% token savings, 3-74x compression ratio.
pip install agent-memoryfrom agent_memory import MemoryStore, ContextBuilder
# Initialize
memory = MemoryStore("./my_project.db")
# Store a compressed observation
memory.store_observation(Observation(
agent_id="researcher-1",
project="my-app",
title="Found pagination bug in /users endpoint",
narrative="The API returns 500 when page > 100 due to missing LIMIT clause",
facts=["Max page size is 100", "No server-side validation"],
concepts=["api", "bug", "pagination"],
))
# Build context for another agent (token-budgeted)
builder = ContextBuilder(memory)
context = builder.build(
project="my-app",
agent_id="coder-1",
task_description="Fix the pagination bug",
)
# -> Returns compressed context within 8000 token budget
# -> Includes researcher-1's findings automatically# Add the marketplace
claude plugin marketplace add Keshab0310/agent-memory
# Install the plugin
claude plugin install agent-memory@keshab-pluginsVerify it's installed:
claude plugin list
# Should show: agent-memory@keshab-plugins v0.1.0 ✔ enabledOnce installed, the plugin works silently in the background with zero configuration:
What happens automatically:
- Every time Claude reads a file, runs a command, or edits code, the PostToolUse hook compresses that tool output into a structured observation and stores it locally
- Every time you start a new session (or resume one), the SessionStart hook injects relevant past observations into Claude's context
- Your plan is auto-detected (Pro/Max/API) and memory budgets adjust accordingly
You don't need to change how you use Claude Code. Just work normally — the plugin handles compression and recall behind the scenes.
The plugin exposes 4 MCP tools that Claude can use when you ask about past work:
You: "What did we find about the database schema yesterday?"
Claude: [uses memory_search tool] -> finds relevant observations
-> "Yesterday we discovered the users table was missing an index
on email, which caused the slow login query. We added a B-tree
index and response time dropped from 2.3s to 45ms."
You: "How much have we saved on tokens?"
Claude: [uses memory_stats tool]
-> "156 observations stored. Compression ratio: 18.2:1.
Token savings: 94%. Estimated cost saved: $1.34."
The 4 tools available:
| Tool | What It Does | When Claude Uses It |
|---|---|---|
memory_search |
Searches past observations by keyword/topic | When you ask "what did we find about X?" |
memory_store |
Manually stores an observation | When you say "remember this for later" |
memory_stats |
Shows token savings dashboard | When you ask about costs or savings |
memory_context |
Builds a context summary for a task | When starting complex multi-step work |
After using Claude Code for a few tasks with the plugin installed:
You: "Show me my memory stats"
Claude: [uses memory_stats]
Total observations: 23
Compression ratio: 12.4:1
Token savings: 91%
If you see observations being stored and savings > 0%, the plugin is working.
claude plugin uninstall agent-memory@keshab-plugins
claude plugin marketplace remove keshab-pluginsAll data is stored locally:
- SQLite database:
~/.claude/plugins/data/agent-memory/memory.db - No data leaves your machine — see PRIVACY.md
Delete all plugin data:
rm -rf ~/.claude/plugins/data/agent-memory/from agent_memory import LocalLLMAgent, AgentConfig, MemoryStore
memory = MemoryStore("./local.db")
agent = LocalLLMAgent(
config=AgentConfig(agent_type="researcher", model="phi4:latest"),
memory=memory,
project="my-app",
base_url="http://localhost:11434/v1", # Ollama
)
result = agent.execute("What are the key design patterns in this codebase?")Raw tool output (file reads, command results, API responses) gets compressed into structured observations:
[discovery] Found pagination bug in /users endpoint
API returns 500 when page > 100 due to missing LIMIT clause
- Max page size is 100
- No server-side validation
A 5,000-token file read becomes a 200-token observation. That's a 25x compression ratio.
All agents write to and read from a shared memory store:
SQLite (structured queries) + FTS5 (full-text search)
|
+-- Optional: ChromaDB (semantic vector search)
Agent B sees what Agent A discovered — no re-querying needed.
Before each agent call, the ContextBuilder assembles a minimal context window:
Budget adapts to your plan automatically:
| Plan | Total Budget | Own Obs | Cross-Agent | Why |
|---|---|---|---|---|
| Pro + Sonnet | 8,000 | 4,000 | 2,400 | Capped usage — stay lean |
| Pro + Opus | 5,000 | 2,500 | 1,500 | Opus burns limits fast — ultra-lean |
| Max + Sonnet | 16,000 | 8,000 | 5,000 | Unlimited — go wider |
| Max + Opus | 50,000 | 25,000 | 18,000 | Unlimited + 1M window — go deep |
| Local LLM | 1,500 | 700 | 300 | Small context windows |
Static content gets cache breakpoints for 90% input cost reduction:
System prompt [CACHED - 10% cost]
Shared context [CACHED - 10% cost]
Agent memory [dynamic - full cost]
User message [dynamic - full cost]
The problem: You ask Claude Code to refactor authentication across 15 files. By file 8, it's forgotten the patterns established in files 1-3. It re-reads them, burning tokens. By file 12, you hit context limits.
How agent-memory solves it:
File 1-3: Claude reads and refactors auth code
-> PostToolUse hook compresses each file read into an observation:
[change] Refactored auth.py — replaced session tokens with JWT
- New pattern: verify_jwt() middleware on all protected routes
- Removed: legacy session_store dependency
File 4-15: Claude continues refactoring
-> SessionStart hook injects compressed observations from files 1-3
-> Claude sees the patterns (150 tokens) instead of re-reading files (5,000 tokens)
-> 97% token savings on context recall
Before agent-memory: 15 files x 5,000 tokens each = 75,000 tokens re-read After agent-memory: 15 observations x 150 tokens = 2,250 tokens. 97% savings.
The problem: Yesterday you spent 2 hours debugging a race condition. You found the root cause, tried 3 approaches, and fixed it. Today, a related bug appears. Claude Code has zero memory of yesterday's work. You start from scratch.
How agent-memory solves it:
Yesterday's session (auto-captured by hooks):
[discovery] Race condition in WebSocket handler
- write_lock missing on shared_state dict
- Reproduced with 5+ concurrent connections
- Tried: asyncio.Lock (failed — wrong event loop)
- Tried: threading.Lock (worked but caused deadlock in tests)
- Fixed: threading.RLock with 5s timeout
Today's session:
You: "There's another threading issue in the notification service"
-> SessionStart hook injects yesterday's context automatically
-> Claude sees the RLock pattern that worked
-> Skips the 2 failed approaches
-> Applies the proven fix in one shot
Without agent-memory: 45 minutes re-investigating the same threading patterns With agent-memory: 5 minutes — Claude already knows what works in your codebase
The problem: You spawn 5 agents to research, code, review, test, and document a feature. Each agent works in isolation. The coder doesn't know what the researcher found. The reviewer doesn't know what the coder tried and rejected.
How agent-memory solves it:
from src.profiles import detect_profile
from src.agents.base import Agent
from src.agents.registry import get_agent_config
from src.memory.store import MemoryStore
memory = MemoryStore("./shared.db")
# Agent 1: Researcher finds the best approach
researcher = Agent(get_agent_config("researcher"), memory, "my-project")
researcher.execute("Research OAuth2 vs API keys for our B2B API")
# -> Stores: [discovery] OAuth2 better for B2B — supports scopes, token rotation
# Agent 2: Coder sees the researcher's findings automatically
coder = Agent(get_agent_config("coder"), memory, "my-project")
coder.execute("Implement the auth system")
# -> ContextBuilder injects: "Researcher found OAuth2 is better for B2B..."
# -> Coder builds OAuth2 without asking "which auth method?"
# Agent 3: Reviewer sees BOTH researcher reasoning AND coder implementation
reviewer = Agent(get_agent_config("reviewer"), memory, "my-project")
reviewer.execute("Review the auth implementation")
# -> Sees researcher's OAuth2 rationale + coder's implementation decisions
# -> Reviews against the original requirements, not just code syntaxWithout shared memory: Reviewer says "why not API keys?" — coder explains — wastes 2 round trips With shared memory: Reviewer already has context. Zero redundant conversation.
The problem: You're on the Pro Plan using Opus 4.6. Adaptive thinking on "High" burns through your daily limit in 10 messages. Each message costs ~$0.50+ in tokens because the context window fills with raw tool output.
How agent-memory solves it:
Without agent-memory (Opus on Pro):
Message 1: Read 3 files (15,000 tokens) + Opus thinking (25,000 tokens) = 40,000 tokens
Message 2: Re-reads same files + new query = 45,000 tokens
Message 3: Context growing, Opus thinking harder = 60,000 tokens
Total after 3 messages: 145,000 tokens. Daily limit: approaching fast.
With agent-memory (auto-detects opus-pro profile):
Message 1: Read 3 files -> compressed to 3 observations (450 tokens)
Opus thinking capped at 10,000 tokens = 25,000 total
Message 2: Observations injected (450 tokens, not 15,000)
New query + thinking = 18,000 total
Message 3: 5,000 token memory budget, lean injection = 20,000 total
Total after 3 messages: 63,000 tokens. 57% savings.
The plugin auto-detects your plan:
# No configuration needed — it reads your environment
from src.profiles import detect_profile
profile = detect_profile() # Returns opus-pro automatically
# -> 5,000 token memory budget (not 50,000)
# -> Thinking capped at 10,000 tokens
# -> Aggressive condensation every 3 observations
# -> Your Pro Plan lasts 3x longerThe problem: You join a new team and need to understand a 500-file codebase. You ask Claude Code to explore it. After reading 20 files, the context is full of raw file contents, and Claude can't synthesize what it learned.
How agent-memory solves it:
Session 1: "Help me understand this codebase"
Claude reads package.json, README, key source files
-> Each file read compressed into observations:
[discovery] FastAPI backend with SQLAlchemy ORM
- 3-layer architecture: routers/ -> services/ -> models/
- PostgreSQL with Alembic migrations
[discovery] React frontend with Redux state
- Component library in src/ui/
- API calls centralized in src/api/client.ts
[discovery] Auth uses JWT with refresh tokens
- Tokens stored in httpOnly cookies
- 15-min access token, 7-day refresh
Session 2 (next day): "Add a new API endpoint for user preferences"
-> SessionStart hook injects Session 1 observations
-> Claude already knows: FastAPI + SQLAlchemy + JWT auth + 3-layer pattern
-> Immediately creates: models/preferences.py, services/preferences.py,
routers/preferences.py following the existing pattern
-> No re-exploration needed
Without agent-memory: Re-read 10+ files every session to rebuild context With agent-memory: Instant recall of codebase architecture in ~2,000 tokens
The problem: You have no visibility into how many tokens your agents consume. You can't tell which agent is wasteful or whether your optimizations are working.
How agent-memory solves it:
from src.metrics.tracker import MetricsTracker
from src.memory.store import MemoryStore
memory = MemoryStore("./data/memory.db")
tracker = MetricsTracker(memory)
tracker.print_dashboard("my-project")Output:
============================================================
MULTI-AGENT SYSTEM METRICS
============================================================
Total API calls: 47
Total tokens: 284,000
Cached tokens: 89,000
Cache hit rate: 31.3%
Compression ratio: 18.2:1
Token savings: 94%
Observations stored: 156
Avg latency: 3,200ms
Per-Agent Breakdown:
------------------------------------------------------------
Type Calls Cache% Cost Ratio Latency
coder 15 38.2% 0.68x 4,100ms
researcher 12 29.1% 0.74x 2,800ms
reviewer 10 33.5% 0.71x 2,400ms
planner 5 18.7% 0.85x 3,900ms
summarizer 5 44.2% 0.62x 1,200ms
============================================================
Or use the MCP tool directly in Claude Code:
You: "How much have we saved on tokens?"
Claude: Uses memory_stats tool
-> "156 observations, 18.2:1 compression ratio, 94% token savings.
Estimated savings: ~267,000 tokens ($1.34 at Sonnet pricing)."
The problem: You want to develop and test multi-agent workflows but don't want to burn API credits during prototyping.
How agent-memory solves it:
# Start Ollama
ollama serve
# Run the full 3-agent pipeline locally
python run_local.py --model phi4:latestAGENT: RESEARCHER
Task: Research token optimization strategies...
Generating... done (114.3s)
Observations: 1 [discovery] Three Core Optimization Strategies
AGENT: CODER
Task: Implement context builder...
Context injected: researcher's findings (automatic)
Generating... done (154.0s)
Observations: 1 [feature] build_context Function
AGENT: REVIEWER
Task: Review the implementation...
Context injected: researcher + coder findings (automatic)
Generating... done (90.2s)
Observations: 1 [discovery] Greedy Algorithm Review
Token Economics:
Compression: 3.0:1
Savings: 66%
Cost: $0.00
The entire memory pipeline (compression, shared memory, context injection) works identically with local models. When you're ready, switch to Anthropic with zero code changes:
# Local development
agent = LocalLLMAgent(config, memory, project, base_url="http://localhost:11434/v1")
# Production — just swap the class
agent = Agent(config, memory, project) # Uses Anthropic APIThe problem: You shouldn't have to configure anything. The plugin should just work optimally whether you're on Pro, Max, or using a local LLM.
How agent-memory solves it:
The plugin auto-detects your model and plan at startup:
| Your Setup | Auto-Detected Profile | Memory Budget | Thinking Cap |
|---|---|---|---|
| Pro + Sonnet (default) | sonnet-pro |
8,000 tokens | none |
| Pro + Opus (/extra-usage) | opus-pro |
5,000 tokens | 10,000 |
| Max + Sonnet | sonnet-max |
16,000 tokens | none |
| Max + Opus | opus-max |
50,000 tokens | none |
| API key (direct) | sonnet-api |
8,000 tokens | none |
| Ollama / LM Studio | local |
1,500 tokens | none |
No configuration files. No environment variables to set. It reads CLAUDE_MODEL, CLAUDE_CODE_MAX_PLAN, and ANTHROPIC_API_KEY from your environment and picks the optimal profile.
Override if needed:
# Force Max Plan profile (if auto-detection gets it wrong)
export AGENT_MEMORY_PLAN=maxsrc/
memory/
store.py # SQLite + ChromaDB + FTS5 memory layer
context_builder.py # Token-budgeted context injection
condenser.py # Periodic summarization pipeline
agents/
base.py # Base agent with memory integration
local_llm.py # Ollama/LM Studio adapter
registry.py # Agent type definitions
cache/
prompt_cache.py # Anthropic prompt caching wrapper
rate_limiter.py # Token bucket RPM/TPM limiter
orchestrator/
router.py # DAG-based multi-agent task router
metrics/
tracker.py # Token/cache/latency tracking
plugin/
mcp_server.py # MCP server for Claude Code
hooks/
post_tool_use.py # Auto-compress tool output
session_start.py # Inject memory at session start
plugin.json # Claude Code plugin manifest
tests/
test_memory.py # Memory store + search + condensation tests
test_orchestrator.py # DAG execution + cache structure tests
| Feature | Status |
|---|---|
| Observation compression (XML/auto-extract) | Working |
| SQLite + FTS5 search | Working |
| ChromaDB semantic search (optional) | Working |
| Token-budgeted context injection | Working |
| Anthropic prompt caching | Working |
| Rate limiter (RPM + TPM) | Working |
| Periodic condensation | Working |
| Local LLM support (Ollama/LM Studio) | Working |
| Multi-agent shared memory | Working |
| DAG-based orchestrator | Working |
| Metrics dashboard | Working |
| Claude Code plugin (MCP + hooks) | Beta |
python run_demo.pyexport ANTHROPIC_API_KEY=sk-ant-...
python run_demo.py --liveollama serve # In another terminal
python run_local.py --model phi4:latestpip install -e ".[dev]"
pytest tests/ -vAll 14 tests cover: memory CRUD, agent isolation, condensation, semantic search, token budgets, cache breakpoints, DAG execution, and metrics logging.
from agent_memory import ContextBudget
budget = ContextBudget(
total=8000, # Total token budget for context
task_description=800, # Reserved for task text
own_observations=4000,# Agent's own recent work
cross_agent=2400, # Other agents' relevant findings
summaries=800, # Condensed history
)from agent_memory.cache import RateLimiter
limiter = RateLimiter(
requests_per_minute=50,
tokens_per_minute=80_000,
)
limiter.acquire_sync(estimated_tokens=4000) # Blocks until slot availableBuilt-in: researcher, coder, reviewer, summarizer, planner. Custom:
config = AgentConfig(
agent_type="my-agent",
model="claude-sonnet-4-6-20250514",
max_output_tokens=2000,
system_prompt="You are a specialized agent for...",
)Measured on real workloads (not synthetic):
| Metric | Result |
|---|---|
| Token savings (compression) | 66-94% |
| Compression ratio | 3:1 to 74:1 |
| Prompt cache hit rate | 23-35% |
| Cache cost reduction | 0.71x |
| Context budget utilization | 43% avg |
| Cross-agent memory sharing | 100% (all agents see shared pool) |
See API_REFERENCE.md for the complete SDK documentation.
- Fork the repo
- Create a feature branch
- Run tests:
pytest tests/ -v - Submit a PR
MIT - see LICENSE