A Python tool to download audio samples from the FLEURS dataset using UV for dependency management.
FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a speech dataset covering languages with parallel sentences. This tool allows you to easily download audio samples for specific languages.
β Status: This tool successfully downloads real FLEURS audio data directly from the Hugging Face dataset repository. It downloads the compressed archives, extracts the audio files, and provides them with complete metadata including transcriptions, speaker information, and audio characteristics.
The tool dynamically discovers all available languages from the FLEURS dataset on Hugging Face using the official API. This ensures access to the complete dataset of 102 languages.
Use uv run fleurs-download --list to see all currently available languages.
This project uses UV for dependency management. Make sure you have UV installed:
# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | shDownload 3 samples for English to a specified directory:
uv run fleurs-download --lang en_us --samples 3 ./output_dirDownload samples for multiple languages:
uv run fleurs-download --lang en_us --lang fr_fr --lang hi_in --samples 3 ./output_dirDownload samples for all supported languages (if no --lang specified):
uv run fleurs-download --samples 3 ./output_dirChoose from different dataset splits:
# Use validation split
uv run fleurs-download --lang en_us --samples 5 --split validation ./output_dir
# Use test split
uv run fleurs-download --lang fr_fr --samples 2 --split test ./output_dir# Show help (both work)
uv run fleurs-download -h
uv run fleurs-download --help
# List available languages
uv run fleurs-download --list
uv run fleurs-download -LUsage: fleurs-download [OPTIONS] [OUTPUT_DIR]
Download audio samples from the FLEURS dataset.
OUTPUT_DIR: Directory where audio files will be saved
Examples: uv run fleurs-download --lang en_us --lang fr_fr --samples 3
./audio_samples uv run fleurs-download --lang hi_in --samples 5 --split
validation ./data
Options:
-l, --lang TEXT Language codes to download (e.g., en_us, fr_fr,
he_il). Can be specified multiple times.
-s, --samples INTEGER Number of samples to download per language
(default: 3)
-p, --split TEXT Dataset split to use (train, dev, test). Default:
train
-r, --random-seed INTEGER Random seed for reproducible sampling. If not
specified, uses random sampling.
-R, --reset Clear output directory before downloading.
Otherwise, append new samples to existing data.
-n, --normalize Normalize audio volume to -20dB RMS for
consistent loudness across samples.
-L, --list List all available language codes and exit.
-h, --help Show this message and exit.
π Fetching available languages from Hugging Face...
β
Found 102 available languages from Hugging Face
π Available FLEURS language codes:
==================================================
af_za - Afrikaans (South Africa)
am_et - Amharic (Ethiopia)
ar_eg - Arabic (Egypt)
as_in - As (India)
ast_es - Ast (Spain)
az_az - Azerbaijani (Azerbaijan)
be_by - Belarusian (Belarus)
bg_bg - Bulgarian (Bulgaria)
bn_in - Bengali (India)
bs_ba - Bosnian (Bosnia and Herzegovina)
ca_es - Catalan (Spain)
ceb_ph - Ceb (Philippines)
ckb_iq - Ckb (Iraq)
cmn_hans_cn - Mandarin Chinese (Simplified)
cs_cz - Czech (Czech Republic)
cy_gb - Welsh (GB)
da_dk - Danish (Denmark)
de_de - German (Germany)
el_gr - El (Greece)
en_us - English (US)
... and 82 more languages
Total: 102 languages available
Usage examples:
uv run fleurs-download -l en_us -s 3 ./output
uv run fleurs-download -l fr_fr -l de_de -s 5 ./multi_lang
By default, the tool randomly selects samples from the available dataset and displays the seed used:
# Random sampling (different samples each time)
uv run fleurs-download --lang en_us --samples 5 ./random_samples
# π² Generated random seed: 860593063 (use --random-seed 860593063 to reproduce)
# Reproducible sampling with specified seed
uv run fleurs-download --lang en_us --samples 5 --random-seed 42 ./reproducible_samples
# π² Using random seed: 42Key Features:
- Automatic Seed Display: Shows the seed used for random sampling
- Reproducibility: Copy the displayed seed to reproduce exact results
- Diverse Sampling: Ensures varied samples rather than always the first N
This ensures you get a diverse set of samples rather than always the first N samples from the dataset.
The --normalize option ensures consistent audio levels across all samples:
# Download with volume normalization
uv run fleurs-download -l en_us -s 5 -n ./normalized_audio
# Using shorthand options
uv run fleurs-download -l fr_fr -s 3 -n -r 42 ./dataBenefits:
- Consistent Volume: All samples normalized to -20dB RMS level
- ML-Ready: Uniform audio levels improve training consistency
- Quality Control: Prevents overly quiet or loud samples
By default, the tool appends new samples to existing data:
# First run - downloads 3 samples
uv run fleurs-download --lang en_us --samples 3 ./my_data
# Second run - adds 2 more samples (total: 5)
uv run fleurs-download --lang en_us --samples 2 ./my_data
# Reset mode - clears directory and downloads fresh
uv run fleurs-download --lang en_us --samples 5 --reset ./my_dataThe tool automatically:
- Overwrites duplicates: Re-downloads and overwrites samples with the same ID
- Updates metadata: Combines existing and new sample information
- Preserves unique data: Existing samples with different IDs remain unless
--resetis used
The tool creates the following directory structure:
output_dir/
βββ en_us/
β βββ en_us_000001.wav
β βββ en_us_000002.wav
β βββ en_us_000003.wav
β βββ en_us_metadata.json
βββ fr_fr/
β βββ fr_fr_000001.wav
β βββ fr_fr_000002.wav
β βββ fr_fr_000003.wav
β βββ fr_fr_metadata.json
βββ hi_in/
βββ hi_in_000001.wav
βββ hi_in_000002.wav
βββ hi_in_000003.wav
βββ hi_in_metadata.json
Each language directory contains:
- Audio files: Real WAV files from FLEURS dataset, 16kHz sampling rate
- Metadata file: JSON file with actual transcriptions, speaker information, and audio metadata
The tool downloads authentic FLEURS dataset content:
- Audio: Real speech recordings from native speakers, converted to PCM 16-bit mono 16000 Hz
- Transcriptions: Actual parallel sentences from the FLoRes benchmark
- Metadata: Complete information including speaker gender, audio duration, and sampling details
- Quality: Professional-grade speech data suitable for research and development
- Caching: Downloaded archives are cached in
.cache/to avoid re-downloading - Format: All audio files are standardised to PCM 16-bit mono 16000 Hz for consistency
The metadata JSON file contains detailed information about each audio sample:
[
{
"id": 151,
"filename": "en_us_000151.wav",
"transcription": "sir richard branson's virgin group had a bid for the bank rejected prior to the bank's nationalisation",
"raw_transcription": "Sir Richard Branson's Virgin Group had a bid for the bank rejected prior to the bank's nationalisation.",
"language": "English",
"gender": 1,
"num_samples": 120000,
"sampling_rate": 16000,
"duration_seconds": 7.5,
"original_filename": "11559549184357409250.wav",
"format": "PCM 16-bit mono"
}
]uv run fleurs-download --lang en_us --lang de_de --lang it_it --samples 5 ./my_audio_datauv run fleurs-download --lang hi_in --lang th_th --samples 10 --split validation ./validation_datauv run fleurs-download --samples 1 ./quick_testuv run fleurs-download --lang invalid_code --samples 3 ./test
# β Invalid language codes: invalid_code
# Available codes: af_za, am_et, ar_eg, as_in, ast_es, az_az, be_by, bg_bg, bn_in, bs_ba...
# Use --list to see all available languagesThe tool automatically caches downloaded archives in the .cache/fleurs/ directory to improve performance:
- First download: Downloads and caches the archive (can be large, ~1-2GB per language)
- Subsequent downloads: Uses cached archive, much faster
- Cache location:
.cache/fleurs/in the project directory - Cache management: Archives are reused across different sample counts and output directories
To clear the cache:
rm -rf .cache/fleurs/uv run black .
uv run isort .- Source: Google FLEURS Dataset
- Audio Format: WAV, 16kHz sampling rate
- Languages: 102 languages total (all available via API)
- Splits: Train (~1000 samples), Validation (~400 samples), Test (~400 samples)
- Random Sampling: Automatic seed display for reproducibility
This tool is provided as-is. Please refer to the FLEURS dataset license for usage terms of the downloaded data.