The Voice Type Classifier is a classification model that given a input audio file, outputs a precise segmentation of speakers.
The four classes that the model will output are:
- FEM stands for adult female speech
- MAL stands for adult male speech
- KCHI stands for key-child speech
- OCH stands for other child speech
The model has been specifically trained to work with child-centered long-form recordings. These are recordings that can span multiple hours and have been collected using a portable recorder attached to the vest of a child (usually 0 to 5 years of age).
To use the model, you will need a unix-based machine (Linux or MacOS) and python version 3.13 or higher installed. Windows is not supported for the moment.
You will then need to install the required packages described in the requirements.txt file (generated from the pyproject.toml) using either pip or the uv package manager.
As a system dependency, to be able to read audio files ensure that ffmpeg is installed.
You can now clone the repo and setup the dependencies:
git clone https://github.com/LAAC-LSCP/VTC.git
cd VTCuv sync # or pip install -r requirements.txtpython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtInference is done using a checkpoint of the model, linking the corresponding config file used for training and the list of audio files to run the model on. You audio files should be in the .wav format, sampled at 16 000 kHz and contain a single channel (mono).
If not, you can use the scripts/convert.py file to convert your audios to 16 000 Hz and average the channels.
uv run scripts/infer.py \
--wavs audios \ # path to the folder containing the audio files
--output predictions \ # output folder
--device cpu # device to run the model on: ('cpu', 'cuda' or 'gpu', 'mps')The output of the model (with segment merging applied, see this pyannote.audio description) will be in <output_folder>/rttm. The raw outputs without segment merging applied are present in <output_folder>/raw_rttm.
Additionaly a CSV version of the detected speecker segments is created in <output_folder>/rttm.csv and <output_folder>/raw_rttm.csv.
An example of a bash script is given to perform inference in scripts/run.sh. Simply set the correct variables in the script and run it:
sh scripts/run.shWe tested the inference pipeline on multiple GPUs and CPUs and display the expected speedup factors that can be used to estimate the total duration needed to process
| Table 1: GPU times | Table 2: CPU times | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
It takes approximatively
- For a
$1\text{ h}$ long audio, the inference will run for approximatively$\approx 4$ seconds. ($3600 / 905$ ) - For a
$16\text{ h}$ longform audio, the inference will run for$\approx 1 \text{ minute}$ and$4 \text{ seconds}$ . ($16 * 3600 / 905$ )
On a Intel(R) Xeon(R) Silver 4214R CPU with a batch size of 64, the inference pipeline will be quite slow:
- For a
$1\text{ h}$ long audio, the inference will run for approximatively$\approx 4$ minutes. ($3600 / 15$ ) - For a
$16\text{ h}$ longform audio, the inference will run for$\approx 1 \text{ hour}$ and$4 \text{ minutes}$ . ($16 * 3600 / 15$ )
We evaluate the new model, VTC 2.0, on a heldout set and compare it to the previous models and the Human performance (Human 2).
| Model | KCHI | OCH | MAL | FEM | Average F1-score |
|---|---|---|---|---|---|
| VTC 1.0 | 68.2 | 30.5 | 41.2 | 63.7 | 50.9 |
| VTC 1.5 | 68.4 | 20.6 | 56.7 | 68.9 | 53.6 |
| VTC 2.0 | 71.8 | 51.4 | 60.3 | 74.8 | 64.6 |
| Human 2 | 79.7 | 60.4 | 67.6 | 71.5 | 69.8 |
Table 1: F1-scores (%) obtained on the standard test set VTC 1.0, VTC 1.5, VTC 2.0, and a second human annotator. The best model is indicated in bold.
As displayed in table 1, our model performs better than previous iterations with performances close to the Human performances. VTC 2.0 even surpasses human like performance on the FEM class.
- OVL: is the overlap between speakers.
- SIL: are the section with silence/noise.
To cite this work, please use the following bibtex.
@misc{charlot2025babyhubertmultilingualselfsupervisedlearning,
title={BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings},
author={Théo Charlot and Tarek Kunze and Maxime Poli and Alejandrina Cristia and Emmanuel Dupoux and Marvin Lavechin},
year={2025},
eprint={2509.15001},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2509.15001},
}The Voice Type Classifier has benefited from numerous contributions over time, following publications document its evolution, listed in reverse chronological order.
GitHub repository: github.com/LAAC-LSCP/VTC-IS-25
@inproceedings{kunze25_interspeech,
title = {{Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier}},
author = {Tarek Kunze and Marianne Métais and Hadrien Titeux and Lucas Elbert and Joseph Coffey and Emmanuel Dupoux and Alejandrina Cristia and Marvin Lavechin},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {2845--2849},
doi = {10.21437/Interspeech.2025-1962},
issn = {2958-1796},
}GitHub repository: github.com/MarvinLvn/voice-type-classifier
@inproceedings{lavechin20_interspeech,
title = {An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings},
author = {Marvin Lavechin and Ruben Bousbib and Hervé Bredin and Emmanuel Dupoux and Alejandrina Cristia},
year = {2020},
booktitle = {Interspeech 2020},
pages = {3072--3076},
doi = {10.21437/Interspeech.2020-1690},
issn = {2958-1796},
}This work uses the segma library which is heavely inspired by pyannote.audio.
This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015450 and 2025-AD011016414) and was developed as part of the ExELang project funded by the European Union (ERC, ExELang, Grant No 101001095).

