Gemini 2.5 Pro and Gemini 2.5 Flash evaluation #91

MyButtermilk · 2025-08-22T10:19:00Z

Summary

Add Gemini evaluation flow aligned with repo template and canary patterns.
Standardize CLI flags; improve robustness; document end-to-end usage.
Clean scripts to be cross-platform and avoid hardcoded paths or codec
pitfalls.

Key Changes

Alignment: Refactor gemini/run_eval.py and run_eval_ml.py to follow shared
data loading, normalization, manifest writing, and WER/RTFx computation
patterns.
CLI Standardization: Replace --model_name with --model_id; use
--max_eval_samples; add --no-streaming with sensible defaults to match the repo.- Torchcodec Avoidance: Load audio via soundfile from bytes/path (no torchcodec
needed), cache under gemini/audio_cache//.
Scripts:
- run_gemini.sh resolves Python from PATH by default and auto-loads .env.
- run_gemini.ps1 resolves Python from PATH/PYTHON_CMD, auto-loads .env,
  sets PYTHONPATH.
- Remove hardcoded absolute paths in all scripts.
Documentation: Overhaul gemini/README.md with Quick Start, How It Works,
single-run examples, full-suite runners, scoring, env vars, troubleshooting, and
file overview.
Security: Ensure .env is respected (ignored by git); scrub leaked key from
run_temp.ps1 (now blank).

Files Touched

gemini/run_eval.py: Align to template; manual audio read; normalized outputs;
manifest + metrics.
gemini/run_eval_ml.py: Standardized flags; minor cleanups; consistent manifest
writing.
gemini/run_gemini.sh: Generic Python resolution; .env auto-loading;
standardized flags.
gemini/run_gemini.ps1: Generic Python resolution; .env auto-loading;
standardized flags.
gemini/run_temp.ps1: Remove key; use script-relative paths; generic Python
resolution.
gemini/README.md: Full documentation of the flow.

Why

Consistency with existing libraries in the repo.
Reliability on diverse environments (no absolute paths, no torchcodec
dependency).
Easier setup and repeatability with .env and documented workflow.

How It Works

Loads datasets (English: no decode; reads bytes/path with soundfile), caches
WAVs under audio_cache/.
Transcribes with Gemini (google-generativeai) using retries and exponential
backoff.
Normalizes references/predictions (English or multilingual).
Writes JSONL manifests to gemini/results/; computes WER/RTFx.
Optional scoring across files via normalizer/eval_utils.score_results.

Usage

Single run (English):
- python run_eval.py --model_id "gemini/gemini-2.5-pro" --dataset_path
  "hf-audio/esb-datasets-test-only-sorted" --dataset "ami" --split "test"
  --max_eval_samples 2
Single run (Multilingual):
- python run_eval_ml.py --model_id "gemini/gemini-2.5-pro" --dataset
  "nithinraok/asr-leaderboard-datasets" --config_name "fleurs_en" --language "en"
  --split "test" --max_eval_samples 2
Full suite:
- Bash: ./run_gemini.sh
- PowerShell: ./run_gemini.ps1
Scoring:
- python -c "import normalizer.eval_utils as e; e.score_results('gemini/
  results', 'gemini/gemini-2.5-pro')"

Testing

Smoke-tested both Pro and Flash across core English datasets with small
samples; manifests written; scoring prints per-dataset and composite metrics.
Verified no torchcodec import needed; ensured .env picked up automatically.
Breaking: Scripts now expect --model_id and --max_eval_samples instead of
previous flags.
Non-breaking for other libraries; changes are contained to gemini/.

…t to run both English and multilingual benchmarks for the Gemini models. The script also includes the logic to score the results after each model's evaluation is complete. I have added `--max_samples 2` for testing.

… torchcodec decode; load .env in run script

…load; updated README with full workflow

MyButtermilk · 2025-08-22T10:30:48Z

According to Google Tests Gemini should be fantastic in transcription quality. This is why I think it should be represented in the OpenASR Benchmark to convince people of the good quality compared to other models.

Could a googler ( @google-gemini ; @logankilpatrick ; @ammaarreshi ; @hapticdata ; @markmcd ) please run it or provide credits to @Deep-unlearning so he can run it to confirm your results ? Thank you very much in advance.

google-labs-jules bot and others added 3 commits August 20, 2025 11:24

gemini: align eval with template/canary; standardize CLI flags; avoid…

bc7b020

… torchcodec decode; load .env in run script

gemini: docs + scripts cleanup; generic Python resolution, .env auto-…

6c9d8ab

…load; updated README with full workflow

MyButtermilk mentioned this pull request Aug 22, 2025

Add Gemini 2.5 Pro #79

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemini 2.5 Pro and Gemini 2.5 Flash evaluation #91

Gemini 2.5 Pro and Gemini 2.5 Flash evaluation #91

Uh oh!

MyButtermilk commented Aug 22, 2025

Uh oh!

MyButtermilk commented Aug 22, 2025

Uh oh!

Uh oh!

Gemini 2.5 Pro and Gemini 2.5 Flash evaluation #91

Are you sure you want to change the base?

Gemini 2.5 Pro and Gemini 2.5 Flash evaluation #91

Uh oh!

Conversation

MyButtermilk commented Aug 22, 2025

Uh oh!

MyButtermilk commented Aug 22, 2025

Uh oh!

Uh oh!