Strategies for optimizing agent performance and reducing costs.
Batch embedding generation is 5x faster than individual calls:
// ❌ Slow - Individual calls
for (const text of texts) {
const embedding = await generateEmbeddingAI({ model, value: text });
}
// ✅ Fast - Batch processing
const result = await generateEmbeddingsAI({ model, values: texts });
const embeddings = result.embeddings;Embedding generation includes automatic LRU caching:
// First call - API request
const embedding1 = await generateEmbeddingAI({ model, value: 'text' });
// Second call - Cached (1000x faster)
const embedding2 = await generateEmbeddingAI({ model, value: 'text' });The SDK automatically executes multiple tools in parallel:
// Agent calls multiple tools - executed in parallel
const agent = new Agent({
tools: {
tool1: { /* ... */ },
tool2: { /* ... */ },
tool3: { /* ... */ }
}
});
// All three tools execute simultaneously
await run(agent, 'Use all tools');Automatic TOON encoding (Recommended - 18-33% token reduction):
Enable automatic TOON encoding for all tool results:
const agent = new Agent({
name: 'Data Agent',
instructions: 'You analyze data.',
tools: {
getUsers: tool({
description: 'Get user list',
inputSchema: z.object({}),
execute: async () => {
const users = await db.users.find().toArray();
// Automatically encoded to TOON (no manual encoding needed)
return users;
}
})
},
useTOON: true // ✅ Enable automatic TOON encoding
});Benefits:
- 18-33% token reduction in most scenarios
- 10-20% faster latency in most scenarios
- Zero code changes - automatic encoding/decoding
- Transfer-safe - transfer markers preserved
Manual TOON encoding (for custom use cases):
import { encodeTOON } from '../../src';
const agent = new Agent({
tools: {
getUsers: tool({
description: 'Get user list',
inputSchema: z.object({}),
execute: async () => {
const users = await db.users.find().toArray();
// Manual TOON encoding (42% smaller than JSON)
return encodeTOON(users);
}
})
}
});When to use TOON:
- ✅ Large tool responses (arrays/objects with many fields)
- ✅ Data-heavy agents (RAG, analytics, reporting)
- ✅ Agents that return structured data repeatedly
- ✅ Cost-sensitive applications
Performance results:
- RAG systems: 18-33% token reduction
- Best case: 33% token savings (React query example)
- Average: 20-25% token reduction
- Latency: 10-20% faster in most scenarios
- MemorySession: Fastest, but not persistent
- RedisSession: Fast, persistent, good for production
- DatabaseSession: Slower, but more durable
- HybridSession: Best of both worlds
// For high-traffic production
const session = new RedisSession('user-123', {
redis: redisClient,
ttl: 3600 // Cache for 1 hour
});Limit stored messages to reduce context size:
const session = new RedisSession('user-123', {
redis: redisClient,
maxMessages: 20 // Only keep last 20 messages
});// ✅ Fast and cheap for simple tasks
const fastAgent = new Agent({
model: openai('gpt-4o-mini'),
instructions: 'Quick responses'
});
// ✅ Powerful for complex tasks
const smartAgent = new Agent({
model: openai('gpt-4o'),
instructions: 'Detailed analysis'
});Run multiple agents in parallel and use the fastest:
const result = await raceAgents(
[fastAgent, smartAgent],
'Simple question',
{ timeoutMs: 3000 }
);Use smaller models for guardrails:
// ✅ Efficient
const result = await run(agent, input, {
inputGuardrails: [
contentSafetyGuardrail({
model: openai('gpt-4o-mini') // Smaller, faster model
})
]
});Guardrails execute in parallel automatically:
// All guardrails execute simultaneously
const result = await run(agent, input, {
inputGuardrails: [
guardrail1,
guardrail2,
guardrail3
]
});Streaming provides better perceived performance:
// ✅ Better UX - user sees response immediately
const stream = await runStream(agent, 'Tell a long story');
for await (const chunk of stream.textStream) {
process.stdout.write(chunk);
}Embeddings are automatically cached, but you can also cache at application level:
const embeddingCache = new Map<string, number[]>();
async function getCachedEmbedding(text: string) {
if (embeddingCache.has(text)) {
return embeddingCache.get(text)!;
}
const result = await generateEmbeddingAI({ model, value: text });
embeddingCache.set(text, result.embedding);
return result.embedding;
}For deterministic queries, cache agent responses:
const responseCache = new Map<string, string>();
async function getCachedResponse(query: string) {
const cacheKey = hashQuery(query);
if (responseCache.has(cacheKey)) {
return responseCache.get(cacheKey)!;
}
const result = await run(agent, query);
responseCache.set(cacheKey, result.finalOutput);
return result.finalOutput;
}Monitor token usage to optimize costs:
const result = await run(agent, 'Hello');
console.log('Tokens used:', result.metadata.usage.totalTokens);
console.log('Cost:', calculateCost(result.metadata.usage));Track execution times:
const start = Date.now();
const result = await run(agent, 'Hello');
const duration = Date.now() - start;
console.log(`Execution time: ${duration}ms`);The SDK automatically handles message format conversion for compatibility:
Problem: Sessions return ModelMessage[] (includes UIMessage), but AI SDK's generateText() requires ModelMessage[].
Solution: SDK automatically converts ModelMessage[] to ModelMessage[] using convertToModelMessages().
// Sessions return ModelMessage[]
const session = new MemorySession('user-123');
const history = await session.getHistory(); // ModelMessage[]
// SDK automatically converts to ModelMessage[] before calling generateText()
const result = await run(agent, 'Hello', { session });
// ✅ No errors - conversion happens automaticallyWhat gets converted:
- ✅ Session history (
ModelMessage[]→ModelMessage[]) - ✅ Tool results (already
ModelMessage[]compatible) - ✅ Transfer messages (preserved as
ModelMessage[]) - ✅ User input (already
ModelMessage[])
No action required - conversion happens automatically in prepareMessages().
- Enable TOON for data-heavy agents (18-33% token reduction)
- Batch operations when possible
- Use caching for repeated operations
- Choose appropriate models for each task
- Limit context size with message limits
- Use streaming for better UX
- Monitor token usage to optimize costs
- Parallel execution for multiple operations
For more details, see:
- TOON Optimization Guide - Complete TOON guide
- API Reference
- Architecture
- Getting Started
- Core Concepts