Voice Type Classifier (VTC) 2.0

The Voice Type Classifier is a classification model that given a input audio file, outputs a precise segmentation of speakers.

The four classes that the model will output are:

FEM stands for adult female speech
MAL stands for adult male speech
KCHI stands for key-child speech
OCH stands for other child speech

The model has been specifically trained to work with child-centered long-form recordings. These are recordings that can span multiple hours and have been collected using a portable recorder attached to the vest of a child (usually 0 to 5 years of age).

Usage

To use the model, you will need a unix-based machine (Linux or MacOS) and python version 3.13 or higher installed. Windows is not supported for the moment.

You will then need to install the required packages described in the requirements.txt file (generated from the pyproject.toml) using either pip or the uv package manager. As a system dependency, to be able to read audio files ensure that ffmpeg is installed.

You can now clone the repo and setup the dependencies:

git clone https://github.com/LAAC-LSCP/VTC.git
cd VTC

Installing dependencies with uv (recommended)

uv sync # or pip install -r requirements.txt

Installing dependencies with pip and python 3.13 or higher (not recommended)

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Inference

Inference is done using a checkpoint of the model, linking the corresponding config file used for training and the list of audio files to run the model on. You audio files should be in the .wav format, sampled at 16 000 kHz and contain a single channel (mono). If not, you can use the scripts/convert.py file to convert your audios to 16 000 Hz and average the channels.

uv run scripts/infer.py \
    --wavs audios \        # path to the folder containing the audio files
    --output predictions \ # output folder
    --device cpu           # device to run the model on: ('cpu', 'cuda' or 'gpu', 'mps')

The output of the model (with segment merging applied, see this pyannote.audio description) will be in <output_folder>/rttm. The raw outputs without segment merging applied are present in <output_folder>/raw_rttm. Additionaly a CSV version of the detected speecker segments is created in <output_folder>/rttm.csv and <output_folder>/raw_rttm.csv.

Helper script

An example of a bash script is given to perform inference in scripts/run.sh. Simply set the correct variables in the script and run it:

sh scripts/run.sh

Model Performance

Runtime

We tested the inference pipeline on multiple GPUs and CPUs and display the expected speedup factors that can be used to estimate the total duration needed to process $x$ hours of audio.

Table 1: GPU times

Table 2: CPU times

Batch size	Hardware	Speedup factor
64	Quadro RTX 8000	1/152
128	Quadro RTX 8000	1/286
256	Quadro RTX 8000	1/531
64	A40	1/450
128	A40	1/358
256	A40	1/650
64	H100	1/182
128	H100	1/466
256	H100	1/905

Batch size	Hardware	Speedup factor
64	Intel(R) Xeon(R) Silver 4214R	1/16
128	Intel(R) Xeon(R) Silver 4214R	1/15
256	Intel(R) Xeon(R) Silver 4214R	1/16
64	AMD EPYC 7453 28-Core	1/20
128	AMD EPYC 7453 28-Core	1/21
256	AMD EPYC 7453 28-Core	1/22
64	AMD EPYC 9334 32-Core	1/25
128	AMD EPYC 9334 32-Core	1/26
256	AMD EPYC 9334 32-Core	1/29

It takes approximatively $1/905$ of the audio duration to run the model with a batch size of 256 on an H100 GPU.

For a $1\text{ h}$ long audio, the inference will run for approximatively $\approx 4$ seconds. ($3600 / 905$)
For a $16\text{ h}$ longform audio, the inference will run for $\approx 1 \text{ minute}$ and $4 \text{ seconds}$. ($16 * 3600 / 905$)

On a Intel(R) Xeon(R) Silver 4214R CPU with a batch size of 64, the inference pipeline will be quite slow:

For a $1\text{ h}$ long audio, the inference will run for approximatively $\approx 4$ minutes. ($3600 / 15$)
For a $16\text{ h}$ longform audio, the inference will run for $\approx 1 \text{ hour}$ and $4 \text{ minutes}$. ($16 * 3600 / 15$)

Model Performance on the heldout set

We evaluate the new model, VTC 2.0, on a heldout set and compare it to the previous models and the Human performance (Human 2).

Model	KCHI	OCH	MAL	FEM	Average F1-score
VTC 1.0	68.2	30.5	41.2	63.7	50.9
VTC 1.5	68.4	20.6	56.7	68.9	53.6
VTC 2.0	71.8	51.4	60.3	74.8	64.6
Human 2	79.7	60.4	67.6	71.5	69.8

Table 1: F1-scores (%) obtained on the standard test set VTC 1.0, VTC 1.5, VTC 2.0, and a second human annotator. The best model is indicated in bold.

As displayed in table 1, our model performs better than previous iterations with performances close to the Human performances. VTC 2.0 even surpasses human like performance on the FEM class.

Confusion Matrices on the heldout set

OVL: is the overlap between speakers.
SIL: are the section with silence/noise.

Citation

To cite this work, please use the following bibtex.

@misc{charlot2025babyhubertmultilingualselfsupervisedlearning,
    title={BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings}, 
    author={Théo Charlot and Tarek Kunze and Maxime Poli and Alejandrina Cristia and Emmanuel Dupoux and Marvin Lavechin},
    year={2025},
    eprint={2509.15001},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2509.15001}, 
}

Acknowledgement

The Voice Type Classifier has benefited from numerous contributions over time, following publications document its evolution, listed in reverse chronological order.

1. VTC 1.5 (Whisper-VTC)

GitHub repository: github.com/LAAC-LSCP/VTC-IS-25

@inproceedings{kunze25_interspeech,
    title     = {{Challenges in Automated Processing of Speech from Child Wearables:  The Case of Voice Type Classifier}},
    author    = {Tarek Kunze and Marianne Métais and Hadrien Titeux and Lucas Elbert and Joseph Coffey and Emmanuel Dupoux and Alejandrina Cristia and Marvin Lavechin},
    year      = {2025},
    booktitle = {{Interspeech 2025}},
    pages     = {2845--2849},
    doi       = {10.21437/Interspeech.2025-1962},
    issn      = {2958-1796},
}

2. VTC 1.0 (PyanNet-VTC)

GitHub repository: github.com/MarvinLvn/voice-type-classifier

@inproceedings{lavechin20_interspeech,
    title     = {An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings},
    author    = {Marvin Lavechin and Ruben Bousbib and Hervé Bredin and Emmanuel Dupoux and Alejandrina Cristia},
    year      = {2020},
    booktitle = {Interspeech 2020},
    pages     = {3072--3076},
    doi       = {10.21437/Interspeech.2020-1690},
    issn      = {2958-1796},
}

This work uses the segma library which is heavely inspired by pyannote.audio.

This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015450 and 2025-AD011016414) and was developed as part of the ExELang project funded by the European Union (ERC, ExELang, Grant No 101001095).

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
figures		figures
model		model
scripts		scripts
.gitattributes		.gitattributes
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Voice Type Classifier (VTC) 2.0

Table of content

Usage

Installing dependencies with uv (recommended)

Installing dependencies with pip and python 3.13 or higher (not recommended)

Inference

Helper script

Model Performance

Runtime

Model Performance on the heldout set

Confusion Matrices on the heldout set

Citation

Acknowledgement

1. VTC 1.5 (Whisper-VTC)

2. VTC 1.0 (PyanNet-VTC)

About

Uh oh!

Releases

Contributors 2

Languages

LAAC-LSCP/VTC

Folders and files

Latest commit

History

Repository files navigation

Voice Type Classifier (VTC) 2.0

Table of content

Usage

Installing dependencies with uv (recommended)

Installing dependencies with pip and python 3.13 or higher (not recommended)

Inference

Helper script

Model Performance

Runtime

Model Performance on the heldout set

Confusion Matrices on the heldout set

Citation

Acknowledgement

1. VTC 1.5 (Whisper-VTC)

2. VTC 1.0 (PyanNet-VTC)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Languages