FI7: Evaluator / A-B framework (sister crate)

**Status:** Future idea, not scheduled. Lives in the v0.5+ ""maybe"" bucket of the roadmap.

## Summary

Benchmark agents across prompts / models / tools. Could live in a sister crate ``temporal-agent-rs-eval``.

## Why this matters

Once you have agents in production, you want to test new prompts / models / tools against historical traces. ``autoagents`` already has an ``evaluator`` module that could be the foundation.

## Why not scheduled

Tangential to the core durability mission. Better suited to a separate crate so it can have its own release cadence and dependencies.

## Open questions

- Sister crate vs. feature flag?
- Trace-replay-based eval vs. live A/B?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FI7: Evaluator / A-B framework (sister crate) #23

Summary

Why this matters

Why not scheduled

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

FI7: Evaluator / A-B framework (sister crate) #23

Description

Summary

Why this matters

Why not scheduled

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions