Skip to content

FI7: Evaluator / A-B framework (sister crate) #23

@frisbeeman

Description

@frisbeeman

Status: Future idea, not scheduled. Lives in the v0.5+ ""maybe"" bucket of the roadmap.

Summary

Benchmark agents across prompts / models / tools. Could live in a sister crate temporal-agent-rs-eval.

Why this matters

Once you have agents in production, you want to test new prompts / models / tools against historical traces. autoagents already has an evaluator module that could be the foundation.

Why not scheduled

Tangential to the core durability mission. Better suited to a separate crate so it can have its own release cadence and dependencies.

Open questions

  • Sister crate vs. feature flag?
  • Trace-replay-based eval vs. live A/B?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestfuturev0.5+ future idea, uncommitted

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions