One repo. One command. Ready-to-train datasets for multi-talker ASR.
Streamline data prep with reproducible manifests, benchmark-ready cutsets, and automatic dependency handling — all aligned with BUT-FIT DiCoW models.
- Unified CLI for single-mic and multi-mic datasets
- Reproducible & sharable manifests and cutsets (via Lhotse)
- Automatic handling of nested dataset dependencies
- Standardized benchmark exports for single- & multi-channel evaluation
- Optional 30s windowed cuts for Whisper / DiCoW training
EMMA Leaderboard
Official JSALT 2025 benchmark built directly on cutsets from this repo.
Single-mic: librispeech, librimix, librispeechmix, ali_meeting-sdm, ami-sdm, ami-ihm-mix, notsofar1-sdm
Multi-mic: aishell4, ali_meeting-mdm, ami-mdm, notsofar1-mdm
Requirements: Python 3.9+, lhotse, huggingface-hub (only for NOTSOFAR-1), sox (AliMeeting)
pip install -r requirements.txtIf you are preparing NOTSOFAR-1, you need to first gain access to the HuggingFace dataset through this link. Then, setup a HuggingFace token (Tutorial Link) and run the command below:
export HF_TOKEN="{YOUR HF TOKEN}"Prepare all datasets:
./prepare.sh --root-dir /path/to/workdirIf the NOTSOFAR-1 download fails due to an API request limit, run the preparation of NOTSOFAR-1 dataset multiple times until it succeeds: ./prepare -d notsofar1-sdm,notsofar1-mdm.
Prepare selected datasets:
./prepare.sh --datasets notsofar1-sdm,ami-sdm --root-dir /path/to/workdirPrepare all single-mic datasets:
./prepare.sh --single-mic-only --root-dir /path/to/workdir- Manifests →
manifests/<dataset>/*.jsonl.gz - 30s cutsets → e.g.
manifests/librimix/librimix_cutset_*_30s.jsonl.gz
- Orchestration scripts →
prepare.sh,prepare_single_mic.sh,prepare_multi_mic.sh - Dataset runners →
dataset_scripts/prepare_*.sh - Utilities →
src/*.py
If you have further questions or interest in our other work, contact us: [email protected], [email protected].
- Cite the original datasets + Lhotse
- Respect dataset licenses
If this repo, its cutsets, or the evaluation protocol were useful, please also cite DiCoW:
@article{POLOK2026101841,
title = {{DiCoW}: Diarization-conditioned {Whisper} for target speaker automatic speech recognition},
journal = {Computer Speech \& Language},
volume = {95},
pages = {101841},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget}
}