audio example tests retain brittle judge criteria (audio-to-audio + py audio-to-text) — parity with #612's generic rewrite

## Background

PR #612's review found the audio example tests' LLM-judge criteria are **brittle** — they assert over-specific model behaviors a correct agent might not exhibit, risking false failures on live runs. #612 rewrote the TS `multimodal-audio-to-text` criteria to be model-behavior-generic, but left two siblings with the old brittle criteria — an inconsistent, flaky-prone state.

## Current-state violations

- `javascript/examples/vitest/tests/multimodal-audio-to-audio.test.ts` retains brittle criteria: *"The agent correctly guesses it's a male voice"*, *"The agent repeats the question"*, *"The agent says what format the input was in (audio or text)"*.
- `python/examples/test_audio_to_text.py` retains the same brittle criteria, while its TS audio-to-text counterpart was already made generic in #612.

## Why now

Surfaced by #612's review; the TS audio-to-text fix already landed, so the parity gap (audio-to-audio + Python audio-to-text still brittle) is fresh and the generic-criteria pattern is available to copy.

---
*Classification: Refactor*
*Status: stub — tracking follow-up from PR #612's review. not-ready until investigated.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio example tests retain brittle judge criteria (audio-to-audio + py audio-to-text) — parity with #612's generic rewrite #655

Background

Current-state violations

Why now

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

audio example tests retain brittle judge criteria (audio-to-audio + py audio-to-text) — parity with #612's generic rewrite #655

Description

Background

Current-state violations

Why now

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions