Skip to content

Quality Metrics

Alexander Veysov edited this page Dec 7, 2021 · 24 revisions

Silero VAD quality metrics

Info

Test Dataset

Test Dataset Description:

  • 30+ languages;
  • 2,200 utterances with an average duration of 7 seconds;
  • In-the-wild dataset, i.e. vastly different audio domains (calls, user recordings, various audios sourced from the Internet);
  • Various SNR (from crisp studio recordings to DIY recordings with background speech or noise) and noise levels;
  • Whole utterances are annotated with 1/0 labels (i.e. is any speech present), not short chunks annotated at millisecond level;
  • 55% of the test set has speech;

Speech and music detection tasks are quite ambiguous in their nature on short chunk level (i.e. 30 - 100 - 250ms), so we chose the simplest definition possible.

Probability

Modern Voice Activity Detectors output speech probability (a float between 0 and 1) of an audio chunk of a desired length using:

  • Some pre-trained model;
  • Some function of its state or some internal buffer;

Threshold

Threshold is a value selected by the user that determines if there is speech in audio. If speech probability of an audio chunk is higher than the set threshold, we assume it has speech. Depending on the desired result threshold should be tailored for a specific data set or domain.

Method

We assume that a given VAD algorithm predicted speech for the whole audio, if its chunk predictions contain at least one uninterrupted sequence of probabilities above a certain threshold and longer than 250 milliseconds. We allow short silences up to 250ms within this sequence.

So test method can be described as follows:

  • Get raw model predictions (sequence of speech probabilities between 0 and 1) for each audio in the test set;
  • Use raw predictions with different thresholds to calculate if there is a speech in the whole audio;
  • Calculate recall, precision, accuracy, zero class recall for each threshold;
  • Draw Precision-Recall curve;

Metrics

Silero VAD vs Other Available Solutions

Parameters: 16000 Hz sampling rate, 30 ms (512 samples).

WebRTC VAD algorithm is extremely fast and pretty good at separating noise from silence, but pretty poor at separating speech from noise.

Picovoice VAD is good overall, but we were able to surpass it in quality (eof 2021).

Vs_competitors

Silero VAD Vs Old Silero VAD

Parameters: 16000 Hz sampling rate, 100 ms (1536 samples) for new VAD models, 250 ms (4000 samples) for the old ones.

As you can see, there was a huge jump in the model's quality. Silero Big model is not publicly available, please contact us if you are interested in it.

Vs_old

Sample Rate Comparison

Diff_sr

Chunk Size Comparison

Diff_chunk_size

Clone this wiki locally