Skip to content

Effectful/litellm does not enable prompt_caching for anthropic #611

@kiranandcode

Description

@kiranandcode

The Anthropic API has a mechanism for prompt-caching, where successive calls to llms with the same prefix will reuse the previous state of the LLM over the prefix instead of running from scratch. This is fundamentally necessary to enable long-running agentic interactions as otherwise as the agents persist for longer, the cost of each of their methods will increase linearly:

Image

Minimal reproducer:

import collections
from effectful.handlers.llm.completions import (
    LiteLLMProvider, _make_message, completion, call_system, call_user,
)
from effectful.handlers.llm.template import Template
from effectful.ops.semantics import handler
from effectful.ops.syntax import ObjectInterpretation, implements

captured = []

class Interceptor(ObjectInterpretation):
    @implements(completion)
    def _completion(self, *args, **kwargs):
        captured.append(kwargs.get("messages") or args[1])
        raise StopIteration

@Template.define
def ask(question: str) -> str:
    """You are a helpful assistant. Answer concisely.\n\n{question}"""

provider = LiteLLMProvider(model="anthropic/claude-sonnet-4-20250514")
try:
    with handler(provider):
        with handler(Interceptor()):
            ask("What is 2+2?")
except StopIteration:
    pass

def has_cache_control(msg):
    c = msg.get("content", "")
    if isinstance(c, list):
        return any("cache_control" in b for b in c if isinstance(b, dict))
    return "cache_control" in msg

print("=== Effectful Prompt Caching Report ===\n")
print(f"Messages sent to litellm.completion(): {len(captured[0])}")
for i, msg in enumerate(captured[0]):
    cached = has_cache_control(msg)
    print(f"  [{msg['role']:>9}] cache_control={cached}")

Outputs:

=== Effectful Prompt Caching Report ===

Messages sent to litellm.completion(): 2
  [   system] cache_control=False
  [     user] cache_control=False

Relevant docs:

Prompt caching optimizes your API usage by allowing resuming from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.

Prompt caching stores KV cache representations and cryptographic hashes of cached content, but does not store the raw text of prompts or responses. This may be suitable for customers who require ZDR-type data retention commitments. See cache lifetime for details.

There are two ways to enable prompt caching:

  • Automatic caching: Add a single cache_control field at the top level of your request. The system automatically applies the cache breakpoint to the last cacheable block and moves it forward as conversations grow. Best for multi-turn conversations where the growing message history should be cached automatically.
  • Explicit cache breakpoints: Place cache_control directly on individual content blocks for fine-grained control over exactly what gets cached.

Caching is enabled automatically for prompts that are 1024 tokens or longer. When you make an API request, the following steps occur:

Cache Routing:

Requests are routed to a machine based on a hash of the initial prefix of the prompt. The hash typically uses the first 256 tokens, though the exact length varies depending on the model.
If you provide the prompt_cache_key parameter, it is combined with the prefix hash, allowing you to influence routing and improve cache hit rates. This is especially beneficial when many requests share long, common prefixes.
If requests for the same prefix and prompt_cache_key combination exceed a certain rate (approximately 15 requests per minute), some may overflow and get routed to additional machines, reducing cache effectiveness.

Cache Lookup: The system checks if the initial portion (prefix) of your prompt exists in the cache on the selected machine.
Cache Hit: If a matching prefix is found, the system uses the cached result. This significantly decreases latency and reduces costs.
Cache Miss: If no matching prefix is found, the system processes your full prompt, caching the prefix afterward on that machine for future requests.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions