Performance Optimization

Strategies for optimizing agent performance and reducing costs.

Embedding Optimization

Use Batch Processing

Batch embedding generation is 5x faster than individual calls:

// ❌ Slow - Individual calls
for (const text of texts) {
  const embedding = await generateEmbeddingAI({ model, value: text });
}

// ✅ Fast - Batch processing
const result = await generateEmbeddingsAI({ model, values: texts });
const embeddings = result.embeddings;

Enable Caching

Embedding generation includes automatic LRU caching:

// First call - API request
const embedding1 = await generateEmbeddingAI({ model, value: 'text' });

// Second call - Cached (1000x faster)
const embedding2 = await generateEmbeddingAI({ model, value: 'text' });

Tool Execution

Parallel Tool Execution

The SDK automatically executes multiple tools in parallel:

// Agent calls multiple tools - executed in parallel
const agent = new Agent({
  tools: {
    tool1: { /* ... */ },
    tool2: { /* ... */ },
    tool3: { /* ... */ }
  }
});

// All three tools execute simultaneously
await run(agent, 'Use all tools');

Optimize Tool Responses with TOON

Automatic TOON encoding (Recommended - 18-33% token reduction):

Enable automatic TOON encoding for all tool results:

const agent = new Agent({
  name: 'Data Agent',
  instructions: 'You analyze data.',
  tools: {
    getUsers: tool({
      description: 'Get user list',
      inputSchema: z.object({}),
      execute: async () => {
        const users = await db.users.find().toArray();
        // Automatically encoded to TOON (no manual encoding needed)
        return users;
      }
    })
  },
  useTOON: true  // ✅ Enable automatic TOON encoding
});

Benefits:

18-33% token reduction in most scenarios
10-20% faster latency in most scenarios
Zero code changes - automatic encoding/decoding
Transfer-safe - transfer markers preserved

Manual TOON encoding (for custom use cases):

import { encodeTOON } from '../../src';

const agent = new Agent({
  tools: {
    getUsers: tool({
      description: 'Get user list',
      inputSchema: z.object({}),
      execute: async () => {
        const users = await db.users.find().toArray();
        // Manual TOON encoding (42% smaller than JSON)
        return encodeTOON(users);
      }
    })
  }
});

When to use TOON:

✅ Large tool responses (arrays/objects with many fields)
✅ Data-heavy agents (RAG, analytics, reporting)
✅ Agents that return structured data repeatedly
✅ Cost-sensitive applications

Performance results:

RAG systems: 18-33% token reduction
Best case: 33% token savings (React query example)
Average: 20-25% token reduction
Latency: 10-20% faster in most scenarios

Session Management

Choose the Right Storage

MemorySession: Fastest, but not persistent
RedisSession: Fast, persistent, good for production
DatabaseSession: Slower, but more durable
HybridSession: Best of both worlds

// For high-traffic production
const session = new RedisSession('user-123', {
  redis: redisClient,
  ttl: 3600 // Cache for 1 hour
});

Limit Message History

Limit stored messages to reduce context size:

const session = new RedisSession('user-123', {
  redis: redisClient,
  maxMessages: 20 // Only keep last 20 messages
});

Model Selection

Use Faster Models for Simple Tasks

// ✅ Fast and cheap for simple tasks
const fastAgent = new Agent({
  model: openai('gpt-4o-mini'),
  instructions: 'Quick responses'
});

// ✅ Powerful for complex tasks
const smartAgent = new Agent({
  model: openai('gpt-4o'),
  instructions: 'Detailed analysis'
});

Race Agents for Best Performance

Run multiple agents in parallel and use the fastest:

const result = await raceAgents(
  [fastAgent, smartAgent],
  'Simple question',
  { timeoutMs: 3000 }
);

Guardrails

Use Efficient Models

Use smaller models for guardrails:

// ✅ Efficient
const result = await run(agent, input, {
  inputGuardrails: [
    contentSafetyGuardrail({ 
      model: openai('gpt-4o-mini') // Smaller, faster model
    })
  ]
});

Parallel Guardrail Execution

Guardrails execute in parallel automatically:

// All guardrails execute simultaneously
const result = await run(agent, input, {
  inputGuardrails: [
    guardrail1,
    guardrail2,
    guardrail3
  ]
});

Streaming

Use Streaming for Long Responses

Streaming provides better perceived performance:

// ✅ Better UX - user sees response immediately
const stream = await runStream(agent, 'Tell a long story');
for await (const chunk of stream.textStream) {
  process.stdout.write(chunk);
}

Caching Strategies

Cache Embeddings

Embeddings are automatically cached, but you can also cache at application level:

const embeddingCache = new Map<string, number[]>();

async function getCachedEmbedding(text: string) {
  if (embeddingCache.has(text)) {
    return embeddingCache.get(text)!;
  }
  
  const result = await generateEmbeddingAI({ model, value: text });
  embeddingCache.set(text, result.embedding);
  return result.embedding;
}

Cache Agent Responses

For deterministic queries, cache agent responses:

const responseCache = new Map<string, string>();

async function getCachedResponse(query: string) {
  const cacheKey = hashQuery(query);
  if (responseCache.has(cacheKey)) {
    return responseCache.get(cacheKey)!;
  }
  
  const result = await run(agent, query);
  responseCache.set(cacheKey, result.finalOutput);
  return result.finalOutput;
}

Monitoring

Track Token Usage

Monitor token usage to optimize costs:

const result = await run(agent, 'Hello');

console.log('Tokens used:', result.metadata.usage.totalTokens);
console.log('Cost:', calculateCost(result.metadata.usage));

Monitor Performance

Track execution times:

const start = Date.now();
const result = await run(agent, 'Hello');
const duration = Date.now() - start;

console.log(`Execution time: ${duration}ms`);

Message Format Handling

Automatic ModelMessage Conversion

The SDK automatically handles message format conversion for compatibility:

Problem: Sessions return ModelMessage[] (includes UIMessage), but AI SDK's generateText() requires ModelMessage[].

Solution: SDK automatically converts ModelMessage[] to ModelMessage[] using convertToModelMessages().

// Sessions return ModelMessage[]
const session = new MemorySession('user-123');
const history = await session.getHistory(); // ModelMessage[]

// SDK automatically converts to ModelMessage[] before calling generateText()
const result = await run(agent, 'Hello', { session });
// ✅ No errors - conversion happens automatically

What gets converted:

✅ Session history (ModelMessage[] → ModelMessage[])
✅ Tool results (already ModelMessage[] compatible)
✅ Transfer messages (preserved as ModelMessage[])
✅ User input (already ModelMessage[])

No action required - conversion happens automatically in prepareMessages().

Best Practices

Enable TOON for data-heavy agents (18-33% token reduction)
Batch operations when possible
Use caching for repeated operations
Choose appropriate models for each task
Limit context size with message limits
Use streaming for better UX
Monitor token usage to optimize costs
Parallel execution for multiple operations

For more details, see:

TOON Optimization Guide - Complete TOON guide
API Reference
Architecture
Getting Started
Core Concepts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimization

Embedding Optimization

Use Batch Processing

Enable Caching

Tool Execution

Parallel Tool Execution

Optimize Tool Responses with TOON

Session Management

Choose the Right Storage

Limit Message History

Model Selection

Use Faster Models for Simple Tasks

Race Agents for Best Performance

Guardrails

Use Efficient Models

Parallel Guardrail Execution

Streaming

Use Streaming for Long Responses

Caching Strategies

Cache Embeddings

Cache Agent Responses

Monitoring

Track Token Usage

Monitor Performance

Message Format Handling

Automatic ModelMessage Conversion

Best Practices

FilesExpand file tree

PERFORMANCE.md

Latest commit

History

PERFORMANCE.md

File metadata and controls

Performance Optimization

Embedding Optimization

Use Batch Processing

Enable Caching

Tool Execution

Parallel Tool Execution

Optimize Tool Responses with TOON

Session Management

Choose the Right Storage

Limit Message History

Model Selection

Use Faster Models for Simple Tasks

Race Agents for Best Performance

Guardrails

Use Efficient Models

Parallel Guardrail Execution

Streaming

Use Streaming for Long Responses

Caching Strategies

Cache Embeddings

Cache Agent Responses

Monitoring

Track Token Usage

Monitor Performance

Message Format Handling

Automatic ModelMessage Conversion

Best Practices