Skip to content

audio example tests retain brittle judge criteria (audio-to-audio + py audio-to-text) — parity with #612's generic rewrite #655

@drewdrewthis

Description

@drewdrewthis

Background

PR #612's review found the audio example tests' LLM-judge criteria are brittle — they assert over-specific model behaviors a correct agent might not exhibit, risking false failures on live runs. #612 rewrote the TS multimodal-audio-to-text criteria to be model-behavior-generic, but left two siblings with the old brittle criteria — an inconsistent, flaky-prone state.

Current-state violations

Why now

Surfaced by #612's review; the TS audio-to-text fix already landed, so the parity gap (audio-to-audio + Python audio-to-text still brittle) is fresh and the generic-criteria pattern is available to copy.


Classification: Refactor
Status: stub — tracking follow-up from PR #612's review. not-ready until investigated.

Metadata

Metadata

Assignees

Labels

P3 - lowLow priority, nice to haveneeds-planIssue needs investigation and plan before implementationnot-readyInvestigation pending; not ready for implementationrefactorCode restructuring, no behavior change

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions