FLEURS Audio Dataset Downloader

A Python tool to download audio samples from the FLEURS dataset using UV for dependency management.

Overview

FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a speech dataset covering languages with parallel sentences. This tool allows you to easily download audio samples for specific languages.

✅ Status: This tool successfully downloads real FLEURS audio data directly from the Hugging Face dataset repository. It downloads the compressed archives, extracts the audio files, and provides them with complete metadata including transcriptions, speaker information, and audio characteristics.

Supported Languages

The tool dynamically discovers all available languages from the FLEURS dataset on Hugging Face using the official API. This ensures access to the complete dataset of 102 languages.

Use uv run fleurs-download --list to see all currently available languages.

Installation

This project uses UV for dependency management. Make sure you have UV installed:

# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

Usage

Basic Usage

Download 3 samples for English to a specified directory:

uv run fleurs-download --lang en_us --samples 3 ./output_dir

Multiple Languages

Download samples for multiple languages:

uv run fleurs-download --lang en_us --lang fr_fr --lang hi_in --samples 3 ./output_dir

All Supported Languages

Download samples for all supported languages (if no --lang specified):

uv run fleurs-download --samples 3 ./output_dir

Different Dataset Splits

Choose from different dataset splits:

# Use validation split
uv run fleurs-download --lang en_us --samples 5 --split validation ./output_dir

# Use test split
uv run fleurs-download --lang fr_fr --samples 2 --split test ./output_dir

Help and Language Listing

# Show help (both work)
uv run fleurs-download -h
uv run fleurs-download --help

# List available languages
uv run fleurs-download --list
uv run fleurs-download -L

Complete Help Output

Usage: fleurs-download [OPTIONS] [OUTPUT_DIR]

  Download audio samples from the FLEURS dataset.

  OUTPUT_DIR: Directory where audio files will be saved

  Examples:     uv run fleurs-download --lang en_us --lang fr_fr --samples 3
  ./audio_samples     uv run fleurs-download --lang hi_in --samples 5 --split
  validation ./data

Options:
  -l, --lang TEXT            Language codes to download (e.g., en_us, fr_fr,
                             he_il). Can be specified multiple times.
  -s, --samples INTEGER      Number of samples to download per language
                             (default: 3)
  -p, --split TEXT           Dataset split to use (train, dev, test). Default:
                             train
  -r, --random-seed INTEGER  Random seed for reproducible sampling. If not
                             specified, uses random sampling.
  -R, --reset                Clear output directory before downloading.
                             Otherwise, append new samples to existing data.
  -n, --normalize            Normalize audio volume to -20dB RMS for
                             consistent loudness across samples.
  -L, --list                 List all available language codes and exit.
  -h, --help                 Show this message and exit.

Available Languages Output

🔍 Fetching available languages from Hugging Face...
✅ Found 102 available languages from Hugging Face
📋 Available FLEURS language codes:
==================================================
  af_za           - Afrikaans (South Africa)
  am_et           - Amharic (Ethiopia)
  ar_eg           - Arabic (Egypt)
  as_in           - As (India)
  ast_es          - Ast (Spain)
  az_az           - Azerbaijani (Azerbaijan)
  be_by           - Belarusian (Belarus)
  bg_bg           - Bulgarian (Bulgaria)
  bn_in           - Bengali (India)
  bs_ba           - Bosnian (Bosnia and Herzegovina)
  ca_es           - Catalan (Spain)
  ceb_ph          - Ceb (Philippines)
  ckb_iq          - Ckb (Iraq)
  cmn_hans_cn     - Mandarin Chinese (Simplified)
  cs_cz           - Czech (Czech Republic)
  cy_gb           - Welsh (GB)
  da_dk           - Danish (Denmark)
  de_de           - German (Germany)
  el_gr           - El (Greece)
  en_us           - English (US)
  ... and 82 more languages

Total: 102 languages available

Usage examples:
  uv run fleurs-download -l en_us -s 3 ./output
  uv run fleurs-download -l fr_fr -l de_de -s 5 ./multi_lang

Random Sampling

By default, the tool randomly selects samples from the available dataset and displays the seed used:

# Random sampling (different samples each time)
uv run fleurs-download --lang en_us --samples 5 ./random_samples
# 🎲 Generated random seed: 860593063 (use --random-seed 860593063 to reproduce)

# Reproducible sampling with specified seed
uv run fleurs-download --lang en_us --samples 5 --random-seed 42 ./reproducible_samples
# 🎲 Using random seed: 42

Key Features:

Automatic Seed Display: Shows the seed used for random sampling
Reproducibility: Copy the displayed seed to reproduce exact results
Diverse Sampling: Ensures varied samples rather than always the first N

This ensures you get a diverse set of samples rather than always the first N samples from the dataset.

Volume Normalization

The --normalize option ensures consistent audio levels across all samples:

# Download with volume normalization
uv run fleurs-download -l en_us -s 5 -n ./normalized_audio

# Using shorthand options
uv run fleurs-download -l fr_fr -s 3 -n -r 42 ./data

Benefits:

Consistent Volume: All samples normalized to -20dB RMS level
ML-Ready: Uniform audio levels improve training consistency
Quality Control: Prevents overly quiet or loud samples

Reset vs Append Mode

By default, the tool appends new samples to existing data:

# First run - downloads 3 samples
uv run fleurs-download --lang en_us --samples 3 ./my_data

# Second run - adds 2 more samples (total: 5)
uv run fleurs-download --lang en_us --samples 2 ./my_data

# Reset mode - clears directory and downloads fresh
uv run fleurs-download --lang en_us --samples 5 --reset ./my_data

The tool automatically:

Overwrites duplicates: Re-downloads and overwrites samples with the same ID
Updates metadata: Combines existing and new sample information
Preserves unique data: Existing samples with different IDs remain unless --reset is used

Output Structure

The tool creates the following directory structure:

output_dir/
├── en_us/
│   ├── en_us_000001.wav
│   ├── en_us_000002.wav
│   ├── en_us_000003.wav
│   └── en_us_metadata.json
├── fr_fr/
│   ├── fr_fr_000001.wav
│   ├── fr_fr_000002.wav
│   ├── fr_fr_000003.wav
│   └── fr_fr_metadata.json
└── hi_in/
    ├── hi_in_000001.wav
    ├── hi_in_000002.wav
    ├── hi_in_000003.wav
    └── hi_in_metadata.json

Each language directory contains:

Audio files: Real WAV files from FLEURS dataset, 16kHz sampling rate
Metadata file: JSON file with actual transcriptions, speaker information, and audio metadata

Real FLEURS Data

The tool downloads authentic FLEURS dataset content:

Audio: Real speech recordings from native speakers, converted to PCM 16-bit mono 16000 Hz
Transcriptions: Actual parallel sentences from the FLoRes benchmark
Metadata: Complete information including speaker gender, audio duration, and sampling details
Quality: Professional-grade speech data suitable for research and development
Caching: Downloaded archives are cached in .cache/ to avoid re-downloading
Format: All audio files are standardised to PCM 16-bit mono 16000 Hz for consistency

Metadata Format

The metadata JSON file contains detailed information about each audio sample:

[
  {
    "id": 151,
    "filename": "en_us_000151.wav",
    "transcription": "sir richard branson's virgin group had a bid for the bank rejected prior to the bank's nationalisation",
    "raw_transcription": "Sir Richard Branson's Virgin Group had a bid for the bank rejected prior to the bank's nationalisation.",
    "language": "English",
    "gender": 1,
    "num_samples": 120000,
    "sampling_rate": 16000,
    "duration_seconds": 7.5,
    "original_filename": "11559549184357409250.wav",
    "format": "PCM 16-bit mono"
  }
]

Examples

Example 1: Download samples for specific languages

uv run fleurs-download --lang en_us --lang de_de --lang it_it --samples 5 ./my_audio_data

Example 2: Download validation samples

uv run fleurs-download --lang hi_in --lang th_th --samples 10 --split validation ./validation_data

Example 3: Download all languages with fewer samples

uv run fleurs-download --samples 1 ./quick_test

Example 4: Error handling for invalid languages

uv run fleurs-download --lang invalid_code --samples 3 ./test
# ❌ Invalid language codes: invalid_code
# Available codes: af_za, am_et, ar_eg, as_in, ast_es, az_az, be_by, bg_bg, bn_in, bs_ba...
# Use --list to see all available languages

Caching

The tool automatically caches downloaded archives in the .cache/fleurs/ directory to improve performance:

First download: Downloads and caches the archive (can be large, ~1-2GB per language)
Subsequent downloads: Uses cached archive, much faster
Cache location: .cache/fleurs/ in the project directory
Cache management: Archives are reused across different sample counts and output directories

To clear the cache:

rm -rf .cache/fleurs/

Development

Code Formatting

uv run black .
uv run isort .

Dataset Information

Source: Google FLEURS Dataset
Audio Format: WAV, 16kHz sampling rate
Languages: 102 languages total (all available via API)
Splits: Train (~1000 samples), Validation (~400 samples), Test (~400 samples)
Random Sampling: Automatic seed display for reproducibility

License

This tool is provided as-is. Please refer to the FLEURS dataset license for usage terms of the downloaded data.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
fleurs_downloader		fleurs_downloader
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FLEURS Audio Dataset Downloader

Overview

Supported Languages

Installation

Usage

Basic Usage

Multiple Languages

All Supported Languages

Different Dataset Splits

Help and Language Listing

Complete Help Output

Available Languages Output

Random Sampling

Volume Normalization

Reset vs Append Mode

Output Structure

Real FLEURS Data

Metadata Format

Examples

Example 1: Download samples for specific languages

Example 2: Download validation samples

Example 3: Download all languages with fewer samples

Example 4: Error handling for invalid languages

Caching

Development

Code Formatting

Dataset Information

License

About

Uh oh!

Releases

Packages

Languages

sam-s10s/fleurs-audio-downloader

Folders and files

Latest commit

History

Repository files navigation

FLEURS Audio Dataset Downloader

Overview

Supported Languages

Installation

Usage

Basic Usage

Multiple Languages

All Supported Languages

Different Dataset Splits

Help and Language Listing

Complete Help Output

Available Languages Output

Random Sampling

Volume Normalization

Reset vs Append Mode

Output Structure

Real FLEURS Data

Metadata Format

Examples

Example 1: Download samples for specific languages

Example 2: Download validation samples

Example 3: Download all languages with fewer samples

Example 4: Error handling for invalid languages

Caching

Development

Code Formatting

Dataset Information

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages