Skip to content

Conversation

MyButtermilk
Copy link

Summary

  • Add Gemini evaluation flow aligned with repo template and canary patterns.
  • Standardize CLI flags; improve robustness; document end-to-end usage.
  • Clean scripts to be cross-platform and avoid hardcoded paths or codec
    pitfalls.

Key Changes

  • Alignment: Refactor gemini/run_eval.py and run_eval_ml.py to follow shared
    data loading, normalization, manifest writing, and WER/RTFx computation
    patterns.
  • CLI Standardization: Replace --model_name with --model_id; use
    --max_eval_samples; add --no-streaming with sensible defaults to match the repo.- Torchcodec Avoidance: Load audio via soundfile from bytes/path (no torchcodec
    needed), cache under gemini/audio_cache//.
  • Scripts:
    • run_gemini.sh resolves Python from PATH by default and auto-loads .env.
    • run_gemini.ps1 resolves Python from PATH/PYTHON_CMD, auto-loads .env,
      sets PYTHONPATH.
    • Remove hardcoded absolute paths in all scripts.
  • Documentation: Overhaul gemini/README.md with Quick Start, How It Works,
    single-run examples, full-suite runners, scoring, env vars, troubleshooting, and
    file overview.
  • Security: Ensure .env is respected (ignored by git); scrub leaked key from
    run_temp.ps1 (now blank).

Files Touched

  • gemini/run_eval.py: Align to template; manual audio read; normalized outputs;
    manifest + metrics.
  • gemini/run_eval_ml.py: Standardized flags; minor cleanups; consistent manifest
    writing.
  • gemini/run_gemini.sh: Generic Python resolution; .env auto-loading;
    standardized flags.
  • gemini/run_gemini.ps1: Generic Python resolution; .env auto-loading;
    standardized flags.
  • gemini/run_temp.ps1: Remove key; use script-relative paths; generic Python
    resolution.
  • gemini/README.md: Full documentation of the flow.

Why

  • Consistency with existing libraries in the repo.
  • Reliability on diverse environments (no absolute paths, no torchcodec
    dependency).
  • Easier setup and repeatability with .env and documented workflow.

How It Works

  • Loads datasets (English: no decode; reads bytes/path with soundfile), caches
    WAVs under audio_cache/.
  • Transcribes with Gemini (google-generativeai) using retries and exponential
    backoff.
  • Normalizes references/predictions (English or multilingual).
  • Writes JSONL manifests to gemini/results/; computes WER/RTFx.
  • Optional scoring across files via normalizer/eval_utils.score_results.

Usage

  • Single run (English):
    • python run_eval.py --model_id "gemini/gemini-2.5-pro" --dataset_path
      "hf-audio/esb-datasets-test-only-sorted" --dataset "ami" --split "test"
      --max_eval_samples 2
  • Single run (Multilingual):
    • python run_eval_ml.py --model_id "gemini/gemini-2.5-pro" --dataset
      "nithinraok/asr-leaderboard-datasets" --config_name "fleurs_en" --language "en"
      --split "test" --max_eval_samples 2
  • Full suite:
    • Bash: ./run_gemini.sh
    • PowerShell: ./run_gemini.ps1
  • Scoring:
    • python -c "import normalizer.eval_utils as e; e.score_results('gemini/
      results', 'gemini/gemini-2.5-pro')"

Testing

  • Smoke-tested both Pro and Flash across core English datasets with small
    samples; manifests written; scoring prints per-dataset and composite metrics.
  • Verified no torchcodec import needed; ensured .env picked up automatically.
  • Breaking: Scripts now expect --model_id and --max_eval_samples instead of
    previous flags.
  • Non-breaking for other libraries; changes are contained to gemini/.

google-labs-jules bot and others added 3 commits August 20, 2025 11:24
…t to run both English and multilingual benchmarks for the Gemini models. The script also includes the logic to score the results after each model's evaluation is complete. I have added `--max_samples 2` for testing.
… torchcodec decode; load .env in run script
@MyButtermilk
Copy link
Author

According to Google Tests Gemini should be fantastic in transcription quality. This is why I think it should be represented in the OpenASR Benchmark to convince people of the good quality compared to other models.

Could a googler ( @google-gemini ; @logankilpatrick ; @ammaarreshi ; @hapticdata ; @markmcd ) please run it or provide credits to @Deep-unlearning so he can run it to confirm your results ? Thank you very much in advance.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant