Skip to content
Discussion options

You must be logged in to vote

TEN VAD rapidly detects speech-to-non-speech transitions, whereas Silero VAD suffers from a delay of several hundred milliseconds, resulting in increased end-to-end latency in human-agent interaction systems. In addition, as demonstrated in the 6.5s-7.0s audio segment, Silero VAD fails to identify short silent durations between adjacent speech segments.

Our window size is 30ms. Most likely it takes several windows to detect end of speech. But even 50-100ms is not a big deal, since this is as accurate as a VAD can get anyway.

We benchmarked ten vad, and it behaves not very well on real life audios, more akin to web rtc, hence the need to highlight this non-issue.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by snakers4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
help wanted Extra attention is needed
2 participants
Converted from issue

This discussion was converted from issue #691 on October 02, 2025 05:21.