Background
PR #612's review found the audio example tests' LLM-judge criteria are brittle — they assert over-specific model behaviors a correct agent might not exhibit, risking false failures on live runs. #612 rewrote the TS multimodal-audio-to-text criteria to be model-behavior-generic, but left two siblings with the old brittle criteria — an inconsistent, flaky-prone state.
Current-state violations
Why now
Surfaced by #612's review; the TS audio-to-text fix already landed, so the parity gap (audio-to-audio + Python audio-to-text still brittle) is fresh and the generic-criteria pattern is available to copy.
Classification: Refactor
Status: stub — tracking follow-up from PR #612's review. not-ready until investigated.
Background
PR #612's review found the audio example tests' LLM-judge criteria are brittle — they assert over-specific model behaviors a correct agent might not exhibit, risking false failures on live runs. #612 rewrote the TS
multimodal-audio-to-textcriteria to be model-behavior-generic, but left two siblings with the old brittle criteria — an inconsistent, flaky-prone state.Current-state violations
javascript/examples/vitest/tests/multimodal-audio-to-audio.test.tsretains brittle criteria: "The agent correctly guesses it's a male voice", "The agent repeats the question", "The agent says what format the input was in (audio or text)".python/examples/test_audio_to_text.pyretains the same brittle criteria, while its TS audio-to-text counterpart was already made generic in chore(examples/voice/#486): retire legacy gpt-4o-audio-preview surface, migrate supported audio examples to gpt-audio-mini #612.Why now
Surfaced by #612's review; the TS audio-to-text fix already landed, so the parity gap (audio-to-audio + Python audio-to-text still brittle) is fresh and the generic-criteria pattern is available to copy.
Classification: Refactor
Status: stub — tracking follow-up from PR #612's review. not-ready until investigated.