Unified Data Preparation for Multi-Talker ASR

One repo. One command. Ready-to-train datasets for multi-talker ASR.
Streamline data prep with reproducible manifests, benchmark-ready cutsets, and automatic dependency handling — all aligned with BUT-FIT DiCoW models.

✨ Highlights

Unified CLI for single-mic and multi-mic datasets
Reproducible & sharable manifests and cutsets (via Lhotse)
Automatic handling of nested dataset dependencies
Standardized benchmark exports for single- & multi-channel evaluation
Optional 30s windowed cuts for Whisper / DiCoW training

📊 Official Benchmark

EMMA Leaderboard
Official JSALT 2025 benchmark built directly on cutsets from this repo.

🧩 Pretrained Models

DiCoW v3.2 (best-performing to date)

📚 Supported Datasets

Single-mic: librispeech, librimix, librispeechmix, ali_meeting-sdm, ami-sdm, ami-ihm-mix, notsofar1-sdm
Multi-mic: aishell4, ali_meeting-mdm, ami-mdm, notsofar1-mdm

⚡ Quickstart

Requirements: Python 3.9+, lhotse, huggingface-hub (only for NOTSOFAR-1), sox (AliMeeting)

pip install -r requirements.txt

If you are preparing NOTSOFAR-1, you need to first gain access to the HuggingFace dataset through this link. Then, setup a HuggingFace token (Tutorial Link) and run the command below:

export HF_TOKEN="{YOUR HF TOKEN}"

Prepare all datasets:

./prepare.sh --root-dir /path/to/workdir

If the NOTSOFAR-1 download fails due to an API request limit, run the preparation of NOTSOFAR-1 dataset multiple times until it succeeds: ./prepare -d notsofar1-sdm,notsofar1-mdm.

Prepare selected datasets:

./prepare.sh --datasets notsofar1-sdm,ami-sdm --root-dir /path/to/workdir

Prepare all single-mic datasets:

./prepare.sh --single-mic-only --root-dir /path/to/workdir

📂 Outputs

Manifests → manifests/<dataset>/*.jsonl.gz
30s cutsets → e.g. manifests/librimix/librimix_cutset_*_30s.jsonl.gz

🗂 Repository Layout

Orchestration scripts → prepare.sh, prepare_single_mic.sh, prepare_multi_mic.sh
Dataset runners → dataset_scripts/prepare_*.sh
Utilities → src/*.py

💬 Contact

If you have further questions or interest in our other work, contact us: [email protected], [email protected].

📖 Citation & License

Cite the original datasets + Lhotse
Respect dataset licenses

If this repo, its cutsets, or the evaluation protocol were useful, please also cite DiCoW:

@article{POLOK2026101841,
title = {{DiCoW}: Diarization-conditioned {Whisper} for target speaker automatic speech recognition},
journal = {Computer Speech \& Language},
volume = {95},
pages = {101841},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
dataset_scripts		dataset_scripts
misc		misc
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
prepare.sh		prepare.sh
prepare_multi_mic.sh		prepare_multi_mic.sh
prepare_single_mic.sh		prepare_single_mic.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unified Data Preparation for Multi-Talker ASR

✨ Highlights

📊 Official Benchmark

🧩 Pretrained Models

📚 Supported Datasets

⚡ Quickstart

📂 Outputs

🗂 Repository Layout

💬 Contact

📖 Citation & License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

BUTSpeechFIT/mt-asr-data-prep

Folders and files

Latest commit

History

Repository files navigation

Unified Data Preparation for Multi-Talker ASR

✨ Highlights

📊 Official Benchmark

🧩 Pretrained Models

📚 Supported Datasets

⚡ Quickstart

📂 Outputs

🗂 Repository Layout

💬 Contact

📖 Citation & License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages