Status: Future idea, not scheduled. Lives in the v0.5+ ""maybe"" bucket of the roadmap.
Summary
Benchmark agents across prompts / models / tools. Could live in a sister crate temporal-agent-rs-eval.
Why this matters
Once you have agents in production, you want to test new prompts / models / tools against historical traces. autoagents already has an evaluator module that could be the foundation.
Why not scheduled
Tangential to the core durability mission. Better suited to a separate crate so it can have its own release cadence and dependencies.
Open questions
- Sister crate vs. feature flag?
- Trace-replay-based eval vs. live A/B?
Status: Future idea, not scheduled. Lives in the v0.5+ ""maybe"" bucket of the roadmap.
Summary
Benchmark agents across prompts / models / tools. Could live in a sister crate
temporal-agent-rs-eval.Why this matters
Once you have agents in production, you want to test new prompts / models / tools against historical traces.
autoagentsalready has anevaluatormodule that could be the foundation.Why not scheduled
Tangential to the core durability mission. Better suited to a separate crate so it can have its own release cadence and dependencies.
Open questions