Skip to content

Enable prompt caching for agent calls (closes #611)#612

Open
kiranandcode wants to merge 3 commits intomasterfrom
kg-prompt-caching
Open

Enable prompt caching for agent calls (closes #611)#612
kiranandcode wants to merge 3 commits intomasterfrom
kg-prompt-caching

Conversation

@kiranandcode
Copy link
Contributor

Updates completion.py such that system messages (call_system): content is now a list with cache_control: {"type": "ephemeral"} — cached across all turns.

Agent user messages (LiteLLMProvider._call): _add_cache_control_to_history() annotates the last user/tool message with cache_control before each call_assistant round — only for templates with history (i.e. Agent subclasses)
Non-agent template calls are unaffected.

OpenAI calls are unaffected (OpenAI enables caching by default), litellm strips cache_control from OpenAI requests automatically.

Cost impact: Anthropic charges 25% more for cache writes but cached reads cost 90% less.

@kiranandcode
Copy link
Contributor Author

@naiimic can you try running the MARA agent with this branch? it should be a lot faster.

@kiranandcode kiranandcode linked an issue Mar 13, 2026 that may be closed by this pull request
Copy link
Contributor

@eb8680 eb8680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we add cache_control to each message at the time it is constructed? Is that incorrect vs the final-message-only behavior here?

if msg["role"] not in ("user", "tool", "assistant"):
continue
content = msg.get("content")
if isinstance(content, list) and content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see this applies to all Messages. Would it be easier to have this live in _make_message, our Message constructor? Or maybe in completions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes more sense, will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no I guess the key difference is that we only apply this if the template has a history so we don't cache every template, only agent ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Effectful/litellm does not enable prompt_caching for anthropic

2 participants