Skip to content
/ VTC Public

📜 A deep learning model for classifying audio frames into [KCHI, OCH, MAL, FEM] classes.

Notifications You must be signed in to change notification settings

LAAC-LSCP/VTC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voice Type Classifier (VTC) 2.0

The Voice Type Classifier is a classification model that given a input audio file, outputs a precise segmentation of speakers.

The four classes that the model will output are:

  • FEM stands for adult female speech
  • MAL stands for adult male speech
  • KCHI stands for key-child speech
  • OCH stands for other child speech

The model has been specifically trained to work with child-centered long-form recordings. These are recordings that can span multiple hours and have been collected using a portable recorder attached to the vest of a child (usually 0 to 5 years of age).

Table of content

  1. Usage
  2. Model Performance
  3. Citation
  4. Acknowledgement

Usage

To use the model, you will need a unix-based machine (Linux or MacOS) and python version 3.13 or higher installed. Windows is not supported for the moment.

You will then need to install the required packages described in the requirements.txt file (generated from the pyproject.toml) using either pip or the uv package manager. As a system dependency, to be able to read audio files ensure that ffmpeg is installed.

You can now clone the repo and setup the dependencies:

git clone https://github.com/LAAC-LSCP/VTC.git
cd VTC

Installing dependencies with uv (recommended)

uv sync # or pip install -r requirements.txt

Installing dependencies with pip and python 3.13 or higher (not recommended)

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Inference

Inference is done using a checkpoint of the model, linking the corresponding config file used for training and the list of audio files to run the model on. You audio files should be in the .wav format, sampled at 16 000 kHz and contain a single channel (mono). If not, you can use the scripts/convert.py file to convert your audios to 16 000 Hz and average the channels.

uv run scripts/infer.py \
    --wavs audios \        # path to the folder containing the audio files
    --output predictions \ # output folder
    --device cpu           # device to run the model on: ('cpu', 'cuda' or 'gpu', 'mps')

The output of the model (with segment merging applied, see this pyannote.audio description) will be in <output_folder>/rttm. The raw outputs without segment merging applied are present in <output_folder>/raw_rttm. Additionaly a CSV version of the detected speecker segments is created in <output_folder>/rttm.csv and <output_folder>/raw_rttm.csv.

Helper script

An example of a bash script is given to perform inference in scripts/run.sh. Simply set the correct variables in the script and run it:

sh scripts/run.sh

Model Performance

Runtime

We tested the inference pipeline on multiple GPUs and CPUs and display the expected speedup factors that can be used to estimate the total duration needed to process $x$ hours of audio.

Table 1: GPU times Table 2: CPU times
Batch size Hardware Speedup factor
64 Quadro RTX 8000 1/152
128 Quadro RTX 8000 1/286
256 Quadro RTX 8000 1/531
64 A40 1/450
128 A40 1/358
256 A40 1/650
64 H100 1/182
128 H100 1/466
256 H100 1/905
Batch size Hardware Speedup factor
64 Intel(R) Xeon(R) Silver 4214R 1/16
128 Intel(R) Xeon(R) Silver 4214R 1/15
256 Intel(R) Xeon(R) Silver 4214R 1/16
64 AMD EPYC 7453 28-Core 1/20
128 AMD EPYC 7453 28-Core 1/21
256 AMD EPYC 7453 28-Core 1/22
64 AMD EPYC 9334 32-Core 1/25
128 AMD EPYC 9334 32-Core 1/26
256 AMD EPYC 9334 32-Core 1/29

It takes approximatively $1/905$ of the audio duration to run the model with a batch size of 256 on an H100 GPU.

  • For a $1\text{ h}$ long audio, the inference will run for approximatively $\approx 4$ seconds. ($3600 / 905$)
  • For a $16\text{ h}$ longform audio, the inference will run for $\approx 1 \text{ minute}$ and $4 \text{ seconds}$. ($16 * 3600 / 905$)

On a Intel(R) Xeon(R) Silver 4214R CPU with a batch size of 64, the inference pipeline will be quite slow:

  • For a $1\text{ h}$ long audio, the inference will run for approximatively $\approx 4$ minutes. ($3600 / 15$)
  • For a $16\text{ h}$ longform audio, the inference will run for $\approx 1 \text{ hour}$ and $4 \text{ minutes}$. ($16 * 3600 / 15$)

Model Performance on the heldout set

We evaluate the new model, VTC 2.0, on a heldout set and compare it to the previous models and the Human performance (Human 2).

Model KCHI OCH MAL FEM Average F1-score
VTC 1.0 68.2 30.5 41.2 63.7 50.9
VTC 1.5 68.4 20.6 56.7 68.9 53.6
VTC 2.0 71.8 51.4 60.3 74.8 64.6
Human 2 79.7 60.4 67.6 71.5 69.8

Table 1: F1-scores (%) obtained on the standard test set VTC 1.0, VTC 1.5, VTC 2.0, and a second human annotator. The best model is indicated in bold.

As displayed in table 1, our model performs better than previous iterations with performances close to the Human performances. VTC 2.0 even surpasses human like performance on the FEM class.

Confusion Matrices on the heldout set

  • OVL: is the overlap between speakers.
  • SIL: are the section with silence/noise.


Citation

To cite this work, please use the following bibtex.

@misc{charlot2025babyhubertmultilingualselfsupervisedlearning,
    title={BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings}, 
    author={Théo Charlot and Tarek Kunze and Maxime Poli and Alejandrina Cristia and Emmanuel Dupoux and Marvin Lavechin},
    year={2025},
    eprint={2509.15001},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2509.15001}, 
}

Acknowledgement

The Voice Type Classifier has benefited from numerous contributions over time, following publications document its evolution, listed in reverse chronological order.

1. VTC 1.5 (Whisper-VTC)

GitHub repository: github.com/LAAC-LSCP/VTC-IS-25

@inproceedings{kunze25_interspeech,
    title     = {{Challenges in Automated Processing of Speech from Child Wearables:  The Case of Voice Type Classifier}},
    author    = {Tarek Kunze and Marianne Métais and Hadrien Titeux and Lucas Elbert and Joseph Coffey and Emmanuel Dupoux and Alejandrina Cristia and Marvin Lavechin},
    year      = {2025},
    booktitle = {{Interspeech 2025}},
    pages     = {2845--2849},
    doi       = {10.21437/Interspeech.2025-1962},
    issn      = {2958-1796},
}

2. VTC 1.0 (PyanNet-VTC)

GitHub repository: github.com/MarvinLvn/voice-type-classifier

@inproceedings{lavechin20_interspeech,
    title     = {An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings},
    author    = {Marvin Lavechin and Ruben Bousbib and Hervé Bredin and Emmanuel Dupoux and Alejandrina Cristia},
    year      = {2020},
    booktitle = {Interspeech 2020},
    pages     = {3072--3076},
    doi       = {10.21437/Interspeech.2020-1690},
    issn      = {2958-1796},
}

This work uses the segma library which is heavely inspired by pyannote.audio.

This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015450 and 2025-AD011016414) and was developed as part of the ExELang project funded by the European Union (ERC, ExELang, Grant No 101001095).

About

📜 A deep learning model for classifying audio frames into [KCHI, OCH, MAL, FEM] classes.

Resources

Stars

Watchers

Forks